MindSpore进行NLP模型开发的常见问题与调优经验分享

本文分享我在参与ms的nlp模型开发的过程中,遇到的一些“杂症”,大家如果有遇到类似的问题,或者有更好的解决办法,欢迎一起交流。

1、dataset的batch问题

当数据集进行batch划分后,最后一个batch小于batch_size时,模型训练加载数据时将会出现异常
报错内容如下:
epoch: 0, current epoch percent: 0.987, step: 154, outputs are (Tensor(shape=[], dtype=Float32, value= 0.15773), Tensor(shape=[], dtype=Bool, value= False))
epoch: 0, current epoch percent: 0.994, step: 155, outputs are (Tensor(shape=[], dtype=Float32, value= 0.016068), Tensor(shape=[], dtype=Bool, value= False))
[ERROR] RUNTIME(3547)kernel task happen error, retCode=0x26, [aicore exception].
[ERROR] RUNTIME(3547)aicore kernel execute failed, device_id=0, stream_id=512, task_id=16182, fault kernel_name=Gather_16602455126779674102_0__kernel0, func_name=Gather_16602455126779674102_0__kernel0
[ERROR] DEVICE(3537,python3):2021-03-09-06:21:49.246.617 [mindspore/ccsrc/runtime/device/ascend/ascend_kernel_runtime.cc:717] SyncStream] Call runtime rtStreamSynchronize error.
[ERROR] DEVICE(3537,python3):2021-03-09-06:21:49.246.670 [mindspore/ccsrc/runtime/device/ascend/ascend_device_address.cc:279] SyncStream] Sync stream error!
Traceback (most recent call last):
File “main.py”, line 220, in
File “main.py”, line 187, in main
do_train(ds, netwithloss, load_pretrain_checkpoint_path, save_finetune_checkpoint_path, epoch_num)
File “main.py”, line 138, in do_train
model.train(epoch_num, dataset, callbacks=callbacks)
File “/home/ma-user/miniconda3/envs/Mindspore-1.1.1-python3.7-aarch64/lib/python3.7/site-packages/mindspore/train/model.py”, line 592, in train
sink_size=sink_size)
File “/home/ma-user/miniconda3/envs/Mindspore-1.1.1-python3.7-aarch64/lib/python3.7/site-packages/mindspore/train/model.py”, line 391, in _train
self._train_dataset_sink_process(epoch, train_dataset, list_callback, cb_params, sink_size)
File “/home/ma-user/miniconda3/envs/Mindspore-1.1.1-python3.7-aarch64/lib/python3.7/site-packages/mindspore/train/model.py”, line 452, in _train_dataset_sink_process
outputs = self._train_network(*inputs)
File “/home/ma-user/miniconda3/envs/Mindspore-1.1.1-python3.7-aarch64/lib/python3.7/site-packages/mindspore/nn/cell.py”, line 351, in call
output = self.construct(*cast_inputs, **kwargs)
File “/home/ma-user/miniconda3/envs/Mindspore-1.1.1-python3.7-aarch64/lib/python3.7/site-packages/mindspore/train/dataset_helper.py”, line 87, in construct
return self.network(*outputs)
File “/home/ma-user/miniconda3/envs/Mindspore-1.1.1-python3.7-aarch64/lib/python3.7/site-packages/mindspore/nn/cell.py”, line 351, in call
output = self.construct(*cast_inputs, **kwargs)
File “/home/ma-user/work/bert_for_finetune.py”, line 117, in construct
label_ids)
File “/home/ma-user/miniconda3/envs/Mindspore-1.1.1-python3.7-aarch64/lib/python3.7/site-packages/mindspore/nn/cell.py”, line 351, in call
output = self.construct(*cast_inputs, **kwargs)
File “/home/ma-user/work/bert_for_finetune.py”, line 269, in construct
logits = self.bert(input_ids, input_mask, token_type_id)
File “/home/ma-user/miniconda3/envs/Mindspore-1.1.1-python3.7-aarch64/lib/python3.7/site-packages/mindspore/nn/cell.py”, line 351, in call
output = self.construct(*cast_inputs, **kwargs)
File “/home/ma-user/work/finetune_eval_model.py”, line 50, in construct
self.bert(input_ids, token_type_id, input_mask)
File “/home/ma-user/miniconda3/envs/Mindspore-1.1.1-python3.7-aarch64/lib/python3.7/site-packages/mindspore/nn/cell.py”, line 351, in call
output = self.construct(*cast_inputs, **kwargs)
File “/home/ma-user/work/bert_model.py”, line 852, in construct
word_embeddings)
File “/home/ma-user/miniconda3/envs/Mindspore-1.1.1-python3.7-aarch64/lib/python3.7/site-packages/mindspore/nn/cell.py”, line 351, in call
output = self.construct(*cast_inputs, **kwargs)
File “/home/ma-user/work/bert_model.py”, line 203, in construct
output = self.dropout(output)
File “/home/ma-user/miniconda3/envs/Mindspore-1.1.1-python3.7-aarch64/lib/python3.7/site-packages/mindspore/nn/cell.py”, line 351, in call
output = self.construct(*cast_inputs, **kwargs)
File “/home/ma-user/miniconda3/envs/Mindspore-1.1.1-python3.7-aarch64/lib/python3.7/site-packages/mindspore/nn/layer/basic.py”, line 157, in construct
out, _ = self.dropout(x)
File “/home/ma-user/miniconda3/envs/Mindspore-1.1.1-python3.7-aarch64/lib/python3.7/site-packages/mindspore/ops/primitive.py”, line 186, in call
return _run_op(self, self.name, args)
File “/home/ma-user/miniconda3/envs/Mindspore-1.1.1-python3.7-aarch64/lib/python3.7/site-packages/mindspore/common/api.py”, line 75, in wrapper
results = fn(*arg, **kwargs)
File “/home/ma-user/miniconda3/envs/Mindspore-1.1.1-python3.7-aarch64/lib/python3.7/site-packages/mindspore/ops/primitive.py”, line 525, in _run_op
output = real_run_op(obj, op_name, args)
RuntimeError: mindspore/ccsrc/runtime/device/ascend/ascend_device_address.cc:279 SyncStream] Sync stream error!

 

未能定位异常的原因,目前的解决办法是drop掉不足batch_size大小的batch,即令drop_remainder为True。

data_set.batch(batch_size, drop_remainder=True)

 

 

2、由于Ascend芯片矩阵运算的设计,需要和16对齐的问题

我开发的模型主要是对BERT进行微调,去完成分类任务,其中有一个数据集为udc数据集,原论文提供的max_seq_len为210,考虑到Ascend对16对齐有要求,于是作了udc数据集使用不同max_seq_len在Ascend910芯片上进行训练推理的对比,如表所示

 


结果表明,224和256都是16的倍数,而210不是,从训练性能和推理性能的对比可以看出,使用16对齐的max_seq_len在Ascend 910芯片上无论是训练还是推理,相比不对齐的情况,性能有很大的提升,因此大家如果有相似的情况,可以考虑是否与16对齐有关。 

 

224的性能与精度表现都较好,因此采用224作为训练用的max_seq_len,并且在Ascend910上使用MindSpore进行模型开发,与论文基准精度(paddle实现)相比,性能、精度均有提高。

 

3、使用FileWriter保存的mindrecord文件重命名后报错无法使用

重现问题的步骤

  1. 使用FileWriter保存mindrecord文件
  2. 重命名mindrecord文件
  3. 训练模型时读取重命名后的mindrecord文件

报错信息

[ERROR] MD(151,python):2021-03-24-22:41:54.507.733 [mindspore/ccsrc/minddata/mindrecord/io/shard_reader.cc:126] Init] Invalid file, DB file can not match file: /cache/data/udc/udc_train.mindrecord
[ERROR] CORE(151,python):2021-03-24-22:41:54.507.824 [mindspore/core/utils/status.cc:43] Status] Thread ID 281473654276112 Unexpected error. Invalid data, MindRecordOp failed to count total rows.
Line of code : 455
File : /home/jenkins/agent-working-dir/workspace/Compile_Ascend_ARM_EulerOS/mindspore/mindspore/ccsrc/minddata/dataset/engine/datasetops/source/mindrecord_op.cc
Traceback (most recent call last):
File "/home/work/user-job-dir/push/run_dgu.py", line 233, in
run_dgu(args_opt)
File "/home/work/user-job-dir/push/run_dgu.py", line 154, in run_dgu
do_train(train_ds, netwithloss, load_pretrain_checkpoint_path, save_finetune_checkpoint_path, epoch_num)
File "/home/work/user-job-dir/push/run_dgu.py", line 54, in do_train
num_examples = dataset.get_dataset_size() * args_opt.train_batch_size
File "/usr/local/ma/python3.7/lib/python3.7/site-packages/mindspore/dataset/engine/datasets.py", line 1456, in get_dataset_size
self.dataset_size = runtime_getter[0].GetDatasetSize(False)
RuntimeError: Thread ID 281473654276112 Unexpected error. Invalid data, MindRecordOp failed to count total rows.

提了issue后得到的反馈如下:

这个是因为MindRecord设计的时候:

  1. 其内部保存了包含文件名在内的元数据(如:文件名、Schema、Page大小等)信息,训练样本保存在数据文件中(如:train.mindrecord),另一个*.db(如:train.mindrecord.db)是其索引文件,两者有一一对应的关系,如果改动了文件名,会导致这中间的对应关系丢失/找到不索引文件。
    train.mindrecord train.mindrecord.db
  2. 且支持用户通过FileWriter指定shard_num=N来自动将用户的数据集分成多个文件,如分成4个mindrecord文件;
    train0.mindrecord train0.mindrecord.db train1.mindrecord train1.mindrecord.db
    train2.mindrecord train2.mindrecord.db train3.mindrecord train3.mindrecord.db
    其中 train0.mindrecord, train1.mindrecord, train2.mindrecord, train3.mindrecord文件名会存储在元数据中,表名这4个文件属于同一个数据集,在用MindDataset加载其中一个的时候,会把4个同时加载进来,如果改动了文件名,那么这4个mindrecord文件的对应关系就丢失了,也会导致加载失败。

issue地址:

https://gitee.com/mindspore/mindspore/issues/I3D4BQ

posted @ 2021-12-30 19:02  MS小白  阅读(462)  评论(0)    收藏  举报