[本科项目实训] P-Tuning v2 测试记录

测试脚本

PRE_SEQ_LEN=64
CHECKPOINT=dsbtpg-chatglm-6b-pt-64-2e-2
STEP=500

CUDA_VISIBLE_DEVICES=0 python3 main.py \
    --do_predict \
    --validation_file devVX.json \
    --test_file devVX.json \
    --overwrite_cache \
    --prompt_column content \
    --response_column summary \
    --model_name_or_path /home/lyc/workspace/ChatGLM-6B/chatglm-6b \
    --ptuning_checkpoint ./output/$CHECKPOINT/checkpoint-$STEP \
    --output_dir ./output/$CHECKPOINT \
    --overwrite_output_dir \
    --max_source_length 64 \
    --max_target_length 64 \
    --per_device_eval_batch_size 1 \
    --predict_with_generate \
    --pre_seq_len $PRE_SEQ_LEN \
    --quantization_bit 8

测试过程

 99%|████████████████████████████████████████████████████████████████████████████████▍| 139/140 [01:50<00:00,  1.27it/s][INFO|configuration_utils.py:575] 2024-05-21 13:41:44,210 >> Generate config GenerationConfig {
  "_from_model_config": true,
  "bos_token_id": 130004,
  "eos_token_id": 130005,
  "pad_token_id": 3,
  "transformers_version": "4.27.1"
}

100%|█████████████████████████████████████████████████████████████████████████████████| 140/140 [01:51<00:00,  1.27it/s]Building prefix dict from the default dictionary ...
05/21/2024 13:41:45 - DEBUG - jieba - Building prefix dict from the default dictionary ...
Dumping model to file cache /tmp/jieba.cache
05/21/2024 13:41:45 - DEBUG - jieba - Dumping model to file cache /tmp/jieba.cache
Loading model cost 0.680 seconds.
05/21/2024 13:41:45 - DEBUG - jieba - Loading model cost 0.680 seconds.
Prefix dict has been built successfully.
05/21/2024 13:41:45 - DEBUG - jieba - Prefix dict has been built successfully.
100%|█████████████████████████████████████████████████████████████████████████████████| 140/140 [01:51<00:00,  1.25it/s]
***** predict metrics *****
  predict_bleu-4             =    76.3107
  predict_rouge-1            =    83.1915
  predict_rouge-2            =    77.6409
  predict_rouge-l            =    91.1686
  predict_runtime            = 0:01:53.47
  predict_samples            =        140
  predict_samples_per_second =      1.234
  predict_steps_per_second   =      1.234

main.py的evaluate脚本可以在指定数据集上评估微调后模型的好坏,其使用了BLUEROUGE两个经典的指标,前者通过比较机器翻译结果和人工翻译结果之间的 n-gram 匹配度来计算相似度,后者通过计算摘要中单词或短语的召回率来评估摘要的质量,两者都分数越高,模型表现质量越好。

注:这里由于训练问题重新训练了,上述数据仅作为格式展示,重跑后结果为:

***** predict metrics *****
  predict_bleu-4             =    99.3069
  predict_rouge-1            =     99.449
  predict_rouge-2            =    99.3863
  predict_rouge-l            =    99.7142
  predict_runtime            = 0:02:19.46
  predict_samples            =        168
  predict_samples_per_second =      1.205
  predict_steps_per_second   =      1.205

参考资料

[1] 机器翻译与自动文摘评价指标 BLEU 和 ROUGE:如何理解和应用-百度开发者中心 (baidu.com)

posted @ 2024-06-24 12:41  yicheng_liu0219  阅读(36)  评论(0)    收藏  举报