训练与预测：文本生成

更新时间：2022-12-17

ERNIE 3.0 Zeus 训练与预测

环境准备

cd wenxin_appzoo/wenxin_appzoo/models_hub
# 下载ERNIE 3.0 Zeus 1.5b
sh download_ernie_3.0_zeus_1.5b_ch.sh
# 下载ERNIE 3.0 Zeus 10b
sh download_ernie_3.0_zeus_10b_ch.sh

ERNIE 3.0 Zeus 训练

配置环境变量

source slurm/env.sh

开始模型训练

# 15亿模型
python3 -m paddle.distributed.fleet.launch run_trainer.py --param_path ./examples/ernie3.0_1.5b_zeus.json

# 100亿模型
python3 -m paddle.distributed.fleet.launch run_trainer.py --param_path ./examples/ernie3.0_10b_zeus.json

ERNIE 3.0 Zeus 预测

环境准备

可以用预置的 CUDA 10.2环境

wget http://bj.bcebos.com/wenxin-models/infer_env_glm.tar
tar xf infer_env_glm.tar

下载完成后配置环境变量

source slurm/env.sh

单卡预测

使用算子融合加速

算子融合

bash tools/fuse_only.sh

注意：

算子融合时需要将ckpt中和优化器相关的参数删掉，否则会报错。可以在ckpt所在目录下执行以下命令删除无关参数

rm *moment*;rm *beta[12]*

CUDA_VISIBLE_DEVICES 环境变量或中指定的卡数要与模型切分数量保持一致，否则会报错

前向运行流程

主要用于调试，不能用来fine-tune模型
使用如下命令进行前向

# 15亿模型
python3 run_trainer.py --param_path ./examples/ernie3.0_1.5b_zeus.json

# 100亿模型
python3 run_trainer.py --param_path ./examples/ernie3.0_10b_zeus.json

Infer方式运行流程

保存Infermodel：

# 15亿模型
python3 run_trainer.py --param_path ./examples/ernie3.0_1.5b_zeus_save_infer_from_ckpt.json

#100亿模型
python3 run_trainer.py --param_path ./examples/ernie3.0_10b_zeus_save_infer_from_ckpt.json

执行以下命令开始模型预测：

# 15亿模型
python3 run_infer.py --param_path ./examples/ernie3.0_1.5b_zeus_infer.json

# 100亿模型
python3 run_infer.py --param_path ./examples/ernie3.0_10b_zeus_infer.json

多卡预测

多卡MP运行切分参数流程

执行切分多MP运行的脚本，将单卡模型转换为多卡模型，以pp=1, mp=4为例

sh tools/mp_rearange.sh

注意：

在模型切分前需要将ckpt中和优化器相关的参数删掉，否则在算子融合时会报错。可以在ckpt所在目录下执行以下命令删除无关参数

rm *moment*;rm *beta[12]*

CUDA_VISIBLE_DEVICES 环境变量或中指定的卡数要与模型切分数量保持一致，否则会报错

前向运行流程

主要用于调试，不能用来fine-tune模型
修改examples中模型配置为上述对应的切分配置，这一步的pp和mp的数值要与上一步模型切分保持一致。如切分为save_model_pp1mp4，则修改：

{
    "num_pp": 1,
    "num_mp": 4
}

然后使用如下命令进行前向

# 15亿模型
python3 -m paddle.distributed.fleet.launch run_trainer.py --param_path ./examples/ernie3.0_1.5b_zeus_dist.json

# 100亿模型
python3 -m paddle.distributed.fleet.launch run_trainer.py --param_path ./examples/ernie3.0_10b_zeus_dist.json

Infer方式运行流程

保存Infermodel：

# 15亿模型
python3 -m paddle.distributed.fleet.launch run_trainer.py --param_path ./examples/ernie3.0_1.5b_zeus_dist_save_infer_from_ckpt.json

#100亿模型
python3 -m paddle.distributed.fleet.launch run_trainer.py --param_path ./examples/ernie3.0_10b_zeus_dist_save_infer_from_ckpt.json

执行以下命令开始模型预测：

# 15亿模型
python3 -m paddle.distributed.fleet.launch run_infer_dist.py --param_path ./examples/ernie3.0_1.5b_zeus_dist_infer.json

# 100亿模型
python3 -m paddle.distributed.fleet.launch run_infer_dist.py --param_path ./examples/ernie3.0_10b_zeus_dist_infer.json

其它配置

配置文件中

use_fp16="False" #是否开启fp16精度推理
fuse="False" # 是否使用fused op，开启前请先利用sh tools/mp_rearange.sh中

常见问题

单卡前向或预测报错，PADDLE TRAINER数不匹配：

assert num_mp * num_pp == self.nranks
AssertionError

可能是由于系统默认PADDLE FLEET环境变量导致的，可以将
python3 run_trainer.py ...
等命令修改为：
python3 -m paddle.distributed.fleet.launch --gpus 0 run_trainer.py ...
其中 --gpus 之后的参数修改为实际使用的显卡序号
前向或预测时报错

FatalError: `Process abort signal` is detected by the operating system.
[TimeInfo: *** Aborted at 1665304472 (unix time) try "date -d @1665304472" if you are using GNU date ***]
[SignalInfo: *** SIGABRT (@0x4916) received by PID 18710 (TID 0x7fc9335fe700) from PID 18710 ***]

检查NVIDIA Driver版本和paddle与cuda版本是否匹配，可能是兼容问题导致的，比如cuda 11.1需要驱动版本>=460

ERNIE-Gen 训练与预测

环境准备

# 模型下载
cd wenxin_appzoo/wenxin_appzoo/models_hub
sh download_ernie_gen_base_ch.sh
# 任务所在目录
cd wenxin_appzoo/wenxin_appzoo/tasks/text_generation/ernie_gen

ERNIE-Gen 训练

python run_trainer_ernie_gen.py --param_path cls_ernie_gen_infilling_ch.json
* 训练模型保存于./output/ernie_gen_base_ch文件夹下

ERNIE-Gen 预测

python run_infer.py --param_path examples/cls_ernie_gen_infilling_ch_infer.json

准备工作：文本生成

（New）文本排序任务