实战演练：使用文心进行模型推理

更新时间：2022-08-01

这里我们以上一节「实战演练：使用文心训练模型训练出的模型」为例，介绍如何使用文心进行模型预测推理。

环境安装

请参考：环境安装与配置

准备数据

根据上节的任务场景，我们知道这是一个经典的NLP文本二分类任务，那么它的预测数据就是一条明文文本，模型预测结果就是在二分类的两个类别上的概率分布。非ERNIE任务需要用户自己先分好词，示例数据如下所示：
```
USB接口 只有 2个 ， 太 少 了 点 ， 不能 接 太多 外 接 设备 ！ 表面 容易 留下 污垢 ！
平时 只 用来 工作 ， 上 上网 ， 挺不错 的 ， 没有 冗余 的 功能 ， 样子 也 比较 正式 ！
还 可以 吧 ， 价格 实惠   宾馆 反馈   2008年4月17日   ：   谢谢 ！ 欢迎 再次 入住 其士 大酒店 。
```
ERNIE任务不需要分词，格式与非ERNIE任务一致，这里不再赘述。
词表文件需要和模型训练时的词表文件保持一致（使用同一个词表文件），分为两列，第一列为词，第二列为id（从0开始），列与列之间用\t进行分隔。文心的词表中，[PAD]、[CLS]、[SEP]、[MASK]、[UNK]这5个词是必须要有的，若用户自备词表，需保证这5个词是存在的。部分词表示例如下所示：
```
[PAD]    0
[CLS]    1
[SEP]    2
[MASK]    3
[UNK]    4
郑重    5
天空    6
工地    7
神圣    8
```
将准备好的待预测数据放在./data/predict_data/目录下，对应的词表文件放在./dict/目录下。

配置json：通过json进行模型配置

训练出的模型储存在./output/cls_bow_ch/save_inference_model/中，在该目录下找到被保存的inference_model文件，如inference_step_251。
将其上面的模型路径填入./examples/cls_bow_ch_infer.json中的inference_model_path字段，以cls_bow_ch_infer.json为例，修改部分示例如下：

{
"dataset_reader": {    ## data部分的内容与上一节的data配置一致，这里不再赘述，区别是预测任务仅需要配置predict_reader
  "predict_reader": {
    "name": "predict_reader",
    "type": "BasicDataSetReader",
    "fields": [
      {
        "name": "text_a",
        "data_type": "string",
        "reader": {
          "type": "CustomTextFieldReader"
        },
        "tokenizer": {
          "type": "CustomTokenizer",
          "split_char": " ",
          "unk_token": "[UNK]",
          "params": null
        },
        "need_convert": true,
        "vocab_path": "./dict/vocab.txt",
        "max_seq_len": 512,
        "truncation_type": 0,
        "padding_id": 0,
        "embedding": null
      }
    ],
    "config": {
      "data_path": "./data/predict_data",
      "shuffle": false, ## 注意！这里的参数必须关掉，打乱顺序输出之后不方便比对数据看结果
      "batch_size": 8,
      "epoch": 1,    ## 注意！这里的epoch要设置为1，重复多次预测没意义。
      "need_data_distribute": false,
      "need_generate_examples": true,  ## 这里设置为true可以返回明文样本
      "key_tag": false
    }
  }
},
"inference": {
  "output_path": "./output/predict_result.txt",  ## 预测结果的输出路径，如果不填则默认输出路径为"./output/predict_result.txt"
  "PADDLE_PLACE_TYPE": "cpu",
  "num_labels": 2,  ## 必填参数，表示分类模型的类别数目是多少，预测结果解析时会用到
  "inference_model_path":   "./output/cls_bow_ch/save_inference_model/inference_step_251",  ## 待预测模型的路径
  "extra_param": {  ## 同trainer，除核心必要信息之外，需要额外标明的参数信息，比如一些meta信息可以作为日志统计的关键字。
    "meta":{
      "job_type": "text_classification"
    }
  }
}
}

启动预测

运行run_infer.py ，选择对应的参数配置文件即可。如下所示：
```
python run_infer.py --param_path ./examples/cls_bow_ch_infer.json
```
预测过程中的日志自动保存在./output/predict_result.txt文件中，预测部分结果如下所示：

2020-02-24 18:46:54,634-INFO: start do predict....
INFO: 02-24 18:46:54: inference.py:59 * 139699868489472 start do predict....
2020-02-24 18:46:54,637-INFO: 0
INFO: 02-24 18:46:54: basic_dataset_reader.py:70 * 139699868489472 0
2020-02-24 18:46:54,669-INFO: [0.48963895 0.51036108]
INFO: 02-24 18:46:54: bow_classification.py:128 * 139699868489472 [0.48963895 0.51036108]                                          ## 两个概率分别表示模型对该预测文本在label=0和label=1上做出的置信判断
2020-02-24 18:46:54,670-INFO: [0.49629667 0.50370336]
INFO: 02-24 18:46:54: bow_classification.py:128 * 139699868489472 [0.49629667 0.50370336]
2020-02-24 18:46:54,671-INFO: [0.49249911 0.50750083]
INFO: 02-24 18:46:54: bow_classification.py:128 * 139699868489472 [0.49249911 0.50750083]
2020-02-24 18:46:54,672-INFO: [0.49087134 0.50912869]
INFO: 02-24 18:46:54: bow_classification.py:128 * 139699868489472 [0.49087134 0.50912869]
2020-02-24 18:46:54,673-INFO: [0.49540547 0.50459456]
INFO: 02-24 18:46:54: bow_classification.py:128 * 139699868489472 [0.49540547 0.50459456]
2020-02-24 18:46:54,673-INFO: [0.49168214 0.50831795]
INFO: 02-24 18:46:54: bow_classification.py:128 * 139699868489472 [0.49168214 0.50831795]
2020-02-24 18:46:54,674-INFO: [0.49629667 0.50370336]
INFO: 02-24 18:46:54: bow_classification.py:128 * 139699868489472 [0.49629667 0.50370336]
2020-02-24 18:46:54,675-INFO: [0.49258387 0.50741607]
INFO: 02-24 18:46:54: bow_classification.py:128 * 139699868489472 [0.49258387 0.50741607]
2020-02-24 18:46:54,677-INFO: 0
INFO: 02-24 18:46:54: basic_dataset_reader.py:70 * 139699868489472 0
2020-02-24 18:46:54,682-INFO: [0.48791078 0.51208919]
INFO: 02-24 18:46:54: bow_classification.py:128 * 139699868489472 [0.48791078 0.51208919]
2020-02-24 18:46:54,683-INFO: [0.4841378  0.51586217]
INFO: 02-24 18:46:54: bow_classification.py:128 * 139699868489472 [0.4841378  0.51586217]
2020-02-24 18:46:54,684-INFO: [0.4887262 0.5112738]
INFO: 02-24 18:46:54: bow_classification.py:128 * 139699868489472 [0.4887262 0.5112738]
2020-02-24 18:46:54,685-INFO: [0.49037534 0.50962466]
INFO: 02-24 18:46:54: bow_classification.py:128 * 139699868489472 [0.49037534 0.50962466]

基于ERNIE训练出的模型，预测方法与上面所述的BOW模型一致，区别就是在配置data部分的json时需要将其修改成ERNIE对应的filed_reader和tokenizer。以./examples/cls_ernie_fc_ch_infer.json为例：

{
"dataset_reader": {
  "predict_reader": {
    "name": "predict_reader",
    "type": "BasicDataSetReader",
    "fields": [
      {
        "name": "text_a",
        "data_type": "string",
        "reader": {
          "type": "ErnieTextFieldReader"   ## ERNIE任务专用的filedreader。
        },
        "tokenizer": {
          "type": "FullTokenizer",  ## 使用FullTokenizer按字进行切分,ERNIE任务专用。
          "split_char": " ",
          "unk_token": "[UNK]"
        },
        "need_convert": true,
        "vocab_path": "../model_files/dict/vocab_ernie_2.0_base_ch.txt",  ## 设置ERNIE模型对应的词表文件，与模型训练时的词表保持一致。
        "max_seq_len": 512,
        "truncation_type": 0,
        "padding_id": 0,
        "embedding": null
      }
    ],
    "config": {
      "data_path": "./data/predict_data",
      "shuffle": false,
      "batch_size": 8,
      "epoch": 1,
      "sampling_rate": 1.0,
      "need_data_distribute": false,
      "need_generate_examples": true
    }
  }
},
"inference": {
  "output_path": "./output/predict_result.txt",
  "inference_model_path": "./output/cls_ernie_fc_ch/save_inference_model/inference_step_251_enc", ## 基于ERNIE的cnn网络训练出来的待预测模型路径。
  "PADDLE_PLACE_TYPE": "cpu",
  "thread_num": 2,   ## 线程数设置
  "num_labels": 2,
  "extra_param": {
    "meta":{
      "job_type": "text_classification"
    }
  }
}
}

其他

上述篇幅描述的是通过离线的run_infer.py脚本，使用Python在本地进行模型的批量预测推理，如果您需要搭建HttpServer进行API方式的预测或者使用C++版的预测部署服务，请单独联系我们

实战演练：使用文心进行模型效果评估

数据预处理工具