ernie预训练报错
Daniel_more 发布于2021-07 浏览:4036 回复:0
0
收藏
快速回复

我在测试这个readme中的预训练部分:https://github.com/PaddlePaddle/ERNIE/blob/repro/README.zh.md#%E5%BC%80%E5%A7%8B%E8%AE%AD%E7%BB%83

就是这个:

======================

预训练任务的启动脚本是 script/zh_task/pretrain.sh, 在开始预训练之前需要把 CUDA、cuDNN、NCCL2 等动态库路径加入到环境变量 LD_LIBRARY_PATH 之中;然后执行 sh script/zh_task/pretrain.sh 就可以基于 demo 数据和默认参数配置开始预训练;

预训练任务进行的过程中会输出当前学习率、训练数据所经过的轮数、当前迭代的总步数、训练误差、训练速度等信息,根据 --validation_steps ${N} 的配置,每间隔 N 步输出模型在验证集的各种指标:

current learning_rate:0.000001
epoch: 1, progress: 1/1, step: 30, loss: 10.540648, ppl: 19106.925781, next_sent_acc: 0.625000, speed: 0.849662 steps/s, file: ./data/demo_train_set.gz, mask_type: mask_word
feed_queue size 70
current learning_rate:0.000001
epoch: 1, progress: 1/1, step: 40, loss: 10.529287, ppl: 18056.654297, next_sent_acc: 0.531250, speed: 0.849549 steps/s, file: ./data/demo_train_set.gz, mask_type: mask_word
feed_queue size 70
current learning_rate:0.000001
epoch: 1, progress: 1/1, step: 50, loss: 10.360563, ppl: 16398.287109, next_sent_acc: 0.625000, speed: 0.843776 steps/s, file: ./data/demo_train_

====================

按照官方说法应该能输出训练的一些信息,但是我这边报错。

错误信息如下:

+ export FLAGS_eager_delete_tensor_gb=0
+ export FLAGS_sync_nccl_allreduce=1
+ export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
+ hostname -i
+ python ./ernie/pretrain_launch.py --nproc_per_node 8 --selected_gpus 0,1,2,3,4,5,6,7 --node_ips 172.29.132.188 --node_id 0
2021-07-23 18:21:30,837-INFO: ----------- Configuration Arguments -----------
[INFO] 2021-07-23 18:21:30,837 [ args.py: 68]: ----------- Configuration Arguments -----------
2021-07-23 18:21:30,838-INFO: current_node_ip: None
[INFO] 2021-07-23 18:21:30,838 [ args.py: 70]: current_node_ip: None
2021-07-23 18:21:30,838-INFO: log_prefix:
[INFO] 2021-07-23 18:21:30,838 [ args.py: 70]: log_prefix:
2021-07-23 18:21:30,838-INFO: node_id: 0
[INFO] 2021-07-23 18:21:30,838 [ args.py: 70]: node_id: 0
2021-07-23 18:21:30,838-INFO: node_ips: 172.29.132.188
[INFO] 2021-07-23 18:21:30,838 [ args.py: 70]: node_ips: 172.29.132.188
2021-07-23 18:21:30,838-INFO: nproc_per_node: 8
[INFO] 2021-07-23 18:21:30,838 [ args.py: 70]: nproc_per_node: 8
2021-07-23 18:21:30,838-INFO: print_config: True
[INFO] 2021-07-23 18:21:30,838 [ args.py: 70]: print_config: True
2021-07-23 18:21:30,838-INFO: selected_gpus: 0,1,2,3,4,5,6,7
[INFO] 2021-07-23 18:21:30,838 [ args.py: 70]: selected_gpus: 0,1,2,3,4,5,6,7
2021-07-23 18:21:30,838-INFO: split_log_path: ./log
[INFO] 2021-07-23 18:21:30,838 [ args.py: 70]: split_log_path: ./log
2021-07-23 18:21:30,838-INFO: training_script:
[INFO] 2021-07-23 18:21:30,838 [ args.py: 70]: training_script:
2021-07-23 18:21:30,838-INFO: training_script_args: []
[INFO] 2021-07-23 18:21:30,838 [ args.py: 70]: training_script_args: []
2021-07-23 18:21:30,838-INFO: ------------------------------------------------
[INFO] 2021-07-23 18:21:30,838 [ args.py: 71]: ------------------------------------------------
2021-07-23 18:21:30,838-INFO: 172.29.132.188
[INFO] 2021-07-23 18:21:30,838 [pretrain_launch.py: 75]: 172.29.132.188
2021-07-23 18:21:30,838-INFO: 1
[INFO] 2021-07-23 18:21:30,838 [pretrain_launch.py: 95]: 1
2021-07-23 18:21:30,838-INFO: all_trainer_endpoints: 172.29.132.188:6170,172.29.132.188:6171,172.29.132.188:6172,172.29.132.188:6173,172.29.132.188:6174,172.29.132.188:6175,172.29.132.188:6176,172.29.132.188:6177, node_id: 0, current_ip: 172.29.132.188, num_nodes: 1, node_ips: ['172.29.132.188'], gpus_per_proc: 1, selected_gpus_per_proc: [['0'], ['1'], ['2'], ['3'], ['4'], ['5'], ['6'], ['7']], nranks: 8
[INFO] 2021-07-23 18:21:30,838 [pretrain_launch.py: 114]: all_trainer_endpoints: 172.29.132.188:6170,172.29.132.188:6171,172.29.132.188:6172,172.29.132.188:6173,172.29.132.188:6174,172.29.132.188:6175,172.29.132.188:6176,172.29.132.188:6177, node_id: 0, current_ip: 172.29.132.188, num_nodes: 1, node_ips: ['172.29.132.188'], gpus_per_proc: 1, selected_gpus_per_proc: [['0'], ['1'], ['2'], ['3'], ['4'], ['5'], ['6'], ['7']], nranks: 8
2021-07-23 18:21:30,852-INFO: subprocess launched, check log at ./log/job.log.0
[INFO] 2021-07-23 18:21:30,852 [pretrain_launch.py: 151]: subprocess launched, check log at ./log/job.log.0
2021-07-23 18:21:30,857-INFO: subprocess launched, check log at ./log/job.log.1
[INFO] 2021-07-23 18:21:30,857 [pretrain_launch.py: 151]: subprocess launched, check log at ./log/job.log.1
2021-07-23 18:21:30,866-INFO: subprocess launched, check log at ./log/job.log.2
[INFO] 2021-07-23 18:21:30,866 [pretrain_launch.py: 151]: subprocess launched, check log at ./log/job.log.2
2021-07-23 18:21:30,875-INFO: subprocess launched, check log at ./log/job.log.3
[INFO] 2021-07-23 18:21:30,875 [pretrain_launch.py: 151]: subprocess launched, check log at ./log/job.log.3
2021-07-23 18:21:30,882-INFO: subprocess launched, check log at ./log/job.log.4
[INFO] 2021-07-23 18:21:30,882 [pretrain_launch.py: 151]: subprocess launched, check log at ./log/job.log.4
2021-07-23 18:21:30,894-INFO: subprocess launched, check log at ./log/job.log.5
[INFO] 2021-07-23 18:21:30,894 [pretrain_launch.py: 151]: subprocess launched, check log at ./log/job.log.5
2021-07-23 18:21:30,901-INFO: subprocess launched, check log at ./log/job.log.6
[INFO] 2021-07-23 18:21:30,901 [pretrain_launch.py: 151]: subprocess launched, check log at ./log/job.log.6
2021-07-23 18:21:30,908-INFO: subprocess launched, check log at ./log/job.log.7
[INFO] 2021-07-23 18:21:30,908 [pretrain_launch.py: 151]: subprocess launched, check log at ./log/job.log.7
Traceback (most recent call last):
File "./ernie/pretrain_launch.py", line 189, in
main(lanch_args)
File "./ernie/pretrain_launch.py", line 177, in main
start_procs(args)
File "./ernie/pretrain_launch.py", line 165, in start_procs
cmd=cmds[i])
subprocess.CalledProcessError: Command '['/opt/conda/envs/python35-paddle120-env/bin/python', '-u', ' ', '--is_distributed', 'true']' returned non-zero exit status 2.

 

代码源文件我没有改过,pretrain_launch.py看起来没有问题,但是其实有没有问题我也不是很清楚……这个错误不知道该怎么改,求教。

收藏
点赞
0
个赞
快速回复
TOP
切换版块