（New）进阶任务：训练数据分布修正

更新时间：2022-12-16

任务简介

受限于数据集收集方法、标注人员经验等影响，构建的训练数据集存在分布偏置问题。模型会利用数据集中的偏置作为预测的捷径，如在情感分析任务中，遇到否定词或描述直接给出“负向”情感预测。这种偏置会导致模型没有学会真正的理解和推理能力，在与训练数据分布一致的测试数据上表现非常好，但在与训练数据分布不一致的测试数据上表现很差，也就是说模型的泛化性和鲁棒性很差。因此文心套件提供了基于数据集统计方法偏置识别方法，并提供了数据分布修正的自动实现。基于数据集统计方法偏置识别方法即统计训练数据中词与标注标签的分布，基于此进行偏置词和数据的识别。数据分布修正通过对非偏置数据多次重复采样，使训练数据分布尽量均衡。该方案通过可信分析-特征分析方法识别训练数据中对模型预测其重要贡献的证据，然后通过分析训练中标签和证据的分布识别偏置样本，对偏置样本重复采样来达到数据均衡的目的。

快速开始

代码结构

任务位于/wenxin_appzoo/tasks/text_classification目录下，是分类任务的一个进阶使用，目录结构如下：

.
├── data
│   ├── analysis_data              ## 可信分析（实例分析、特征分析）的demo数据
│   │   ├── example_analysis       ## 实例分析 
│   │   │   └── demo.txt
│   │   └── feature_analysis       ## 特征分析
│   │       └── demo.txt
......
│   ├── dict
│   │   ├── sentencepiece.bpe.model
│   │   └── vocab.txt
......
│   ├── predict_data
│   │   └── infer.txt
│   ├── test_data
│   │   └── test.txt
│   ├── train_data
│   │   └── train.txt
......
├── data_set_reader
│   ......
├── ernie_doc_infer_server.py
├── ernie_run_infer_server.py
├── examples
│   ├── cls_ernie_fc_ch_data_distribution_correct.json   ## 获取数据集样本分布情况任务对应的配置文件
│   ├── cls_ernie_fc_ch_example_analysis.json   ## 实例分析任务的配置文件
│   ├── cls_ernie_fc_ch_feature_analysis.json   ## 特征分析任务的配置文件
│   ├── cls_ernie_fc_ch_find_dirty_data.json    ## 筛选脏数据任务的配置文件
│   ├── cls_ernie_fc_ch_infer.json
│   ├── cls_ernie_fc_ch_infer_with_active_learning.json
│   ├── cls_ernie_fc_ch_infer_with_iflytek.json
│   ├── cls_ernie_fc_ch.json
│   ......
├── find_dirty_data.py              ## 基于实例分析的结果查找脏数据的脚本
├── inference
│   ......
├── model
│   ├── base_cls.py
│   ├── ernie_classification.py
│   ......
├── reader
│   ......
├── run_balance_data.py                  ## 均衡训练集数据分布的脚本
├── run_data_distribution_correct.py     ## 获取数据集样本分布情况的脚本
├── run_example_analysis.py              ## 运行实例分析的启动脚本
├── run_features_analysis.py             ## 运行特征分析的启动脚本
├── ......
├── trainer
│   ......
└── trust_analysis
    ├── gradient_similarity_wenxin.py    ## 基于梯度相似度的实例分析脚本
    ├── integrated_gradients_wenxin.py   ## 基于积分梯度的特征分析脚本
    └── respresenter_point_wenxin.py     ## 基于表示点方法的实例分析脚本

准备工作

在进行数据偏置分析之前，首先需要训练出一个可预测的模型，本次文心套件中仅提供在文本分类任务上的数据偏置分析及纠正功能，所以用户首先需要在分类任务中训练出一个模型，这里以ERNIE 3.0 Base模型举例。

模型准备

预训练模型均存放于wenxin_appzoo/wenxin_appzoo/models_hub文件夹下，进入文件夹下，执行sh download_ernie_3.0_base_ch.sh 即可下载ERNIE 3.0 Base模型的模型参数、字典、网络配置文件。

训练准备

数据准备可以参考：V2.1.0准备工作训练配置文件如下（examples/cls_ernie_fc_ch.json）：

{
  "dataset_reader": {
    "train_reader": {
      "name": "train_reader",  ## 训练、验证、测试各自基于不同的数据集，数据格式也可能不一样，可以在json中配置不同的reader，此处为训练集的reader。
      "type": "BasicDataSetReader",  ## 采用BasicDataSetReader，其封装了常见的读取tsv、txt文件、组batch等操作。
      "fields": [## 域（field）是文心的高阶封装，对于同一个样本存在不同域的时候，不同域有单独的数据类型（文本、数值、整型、浮点型）、单独的词表(vocabulary)等，可以根据不同域进行语义表示，如文本转id等操作，field_reader是实现这些操作的类。
        {
          "name": "text_a",     ## 文本分类任务的第一个特征域，命名为"text_a"。
          "data_type": "string",
          "reader": {
            "type": "ErnieTextFieldReader"
          },
          "tokenizer": {
            "type": "FullTokenizer",   ## 指定text_a分词器，除ernie-tiny模型之外，其余基本上都固定为FullTokenizer分词器。
            "split_char": " ",
            "unk_token": "[UNK]"
          },
          "need_convert": true,
          "vocab_path": "../../models_hub/ernie_3.0_base_ch_dir/vocab.txt",  ## 词表地址
          "max_seq_len": 512,
          "truncation_type": 0,
          "padding_id": 0
        },
        {
          "name": "label",
          "data_type": "int",
          "reader": {
            "type": "ScalarFieldReader"
          },
          "tokenizer": null,
          "need_convert": false,
          "vocab_path": "",
          "max_seq_len": 1,
          "truncation_type": 0,
          "padding_id": 0,
          "embedding": null
        }
      ],
      "config": {
        "data_path": "./data/train_data",    ## 数据路径。
        "shuffle": true,
        "batch_size": 8,
        "epoch": 10,
        "sampling_rate": 1.0,
        "need_data_distribute": true,
        "need_generate_examples": false
      }
    },
    "test_reader": {    ## 此处为测试集的reader。
      "name": "test_reader",
      "type": "BasicDataSetReader",
      "fields": [
        {
          "name": "text_a",
          "data_type": "string",
          "reader": {
            "type": "ErnieTextFieldReader"
          },
          "tokenizer": {
            "type": "FullTokenizer",
            "split_char": " ",
            "unk_token": "[UNK]"
          },
          "need_convert": true,
          "vocab_path": "../../models_hub/ernie_3.0_base_ch_dir/vocab.txt",
          "max_seq_len": 512,
          "truncation_type": 0,
          "padding_id": 0
        },
        {
          "name": "label",
          "data_type": "int",
          "need_convert": false,
          "reader": {
            "type": "ScalarFieldReader"
          },
          "tokenizer": null,
          "vocab_path": "",
          "max_seq_len": 1,
          "truncation_type": 0,
          "padding_id": 0,
          "embedding": null
        }
      ],
      "config": {
        "data_path": "./data/test_data",
        "shuffle": false,
        "batch_size": 8,
        "epoch": 1,
        "sampling_rate": 1.0,
        "need_data_distribute": false,
        "need_generate_examples": false
      }
    }
  },
  "model": {
    "type": "ErnieClassification",
    "is_dygraph": 1,
    "optimization": {   ## 优化器设置，文心ERNIE推荐的默认设置。
      "learning_rate": 2e-05,
      "use_lr_decay": true,
      "warmup_steps": 0,
      "warmup_proportion": 0.1,
      "weight_decay": 0.01,
      "use_dynamic_loss_scaling": false,
      "init_loss_scaling": 128,
      "incr_every_n_steps": 100,
      "decr_every_n_nan_or_inf": 2,
      "incr_ratio": 2.0,
      "decr_ratio": 0.8
    },
    "embedding": {
      "config_path": "../../models_hub/ernie_3.0_base_ch_dir/ernie_config.json"
    },
    "num_labels": 2
  },
  "trainer": {
    "type": "CustomDynamicTrainer",
    "PADDLE_PLACE_TYPE": "gpu",
    "PADDLE_IS_FLEET": 0,    ## 是否启用fleetrun运行，多卡运行时必须使用fleetrun，单卡时即可以使用fleetrun启动也可以直接python启动
    "train_log_step": 10,
    "use_amp": true,
    "is_eval_dev": 0,
    "is_eval_test": 1,
    "eval_step": 100,
    "save_model_step": 200,
    "load_parameters": "",
    "load_checkpoint": "",
    "pre_train_model": [
      {
        "name": "ernie_3.0_base_ch",
        "params_path": "../../models_hub/ernie_3.0_base_ch_dir/params"
      }
    ],
    "output_path": "./output/cls_ernie_3.0_base_fc_ch_dy",
    "extra_param": {
      "meta":{
        "job_type": "text_classification"
      }

    }
  }
}

开始训练

# 进入指定任务的目录
cd wenxin_appzoo/wenxin_appzoo/tasks/text_classification
# 单卡训练，如果fleetrun设置为0，则使用下面命令
python ./run_trainer.py --param_path "./examples/cls_ernie_fc_ch.json"
# 单卡训练，如果fleetrun设置为1，则使用下面的命令
fleetrun --log_dir log ./run_trainer.py --param_path "./examples/cls_ernie_fc_ch.json" 1>log/lanch.log 2>&1

# 多卡训练，fleetrun必须设置为1，使用下面的命令
fleetrun --log_dir log ./run_trainer.py --param_path "./examples/cls_ernie_fc_ch.json" 1>log/lanch.log 2>&1

通过上述脚本调用json文件开启训练
训练阶段日志文件于log文件夹下，workerlog.N 保存了第N张卡的log日志内容，如遇到程序报错可以通过查看不同卡的workerlog.N定位到有效的报错信息。
训练模型保存于./output/cls_ernie_3.0_base_fc_ch_dy文件夹下，保存好的模型，我们选择checkpoints文件进行下一步的操作。

开始数据分布修正

数据分布修正主要分为两个步骤：首先基于可信学习-特征分析方法（进阶任务：模型可解释性-特征分析），识别训练数据集中对模型预测其重要贡献的证据（token）和其对应的频次。然后基于统计的证据及其频次分析偏置样本，在偏置样本的不均衡类别上重复采样，达到数据均衡的目的。数据demo如下所示：

选择珠江花园的原因就是方便，有电动扶梯直接到达海边，周围餐馆、食廊、商场、超市、摊位一应俱全。酒店装修一般，但还算整洁。         1 
15.4寸笔记本的键盘确实爽，基本跟台式机差不多了，蛮喜欢数字小键盘，输数字特方便，样子也很美观，做工也相当不错         1 
房间太小。其他的都一般。。。。。。。。。         0

对训练集数据进行特征分析：

配置文件如下（examples/cls_ernie_fc_ch_data_distribution_correct.json）

{
    "dataset_reader": {
      "train_reader": {
      "name": "train_reader",    ## 训练集reader配置
      "type": "BasicDataSetReader",
      "fields": [
        {
          "name": "text_a",
          "data_type": "string",
          "reader": {
            "type": "ErnieTextFieldReader"
          },
          "tokenizer": {
            "type": "FullTokenizer",
            "split_char": " ",
            "unk_token": "[UNK]"
          },
          "need_convert": true,
          "vocab_path": "../../models_hub/ernie_3.0_base_ch_dir/vocab.txt",
          "max_seq_len": 512,
          "truncation_type": 0,
          "padding_id": 0
        },
        {
          "name": "label",
          "data_type": "int",
          "reader": {
            "type": "ScalarFieldReader"
          },
          "tokenizer": null,
          "need_convert": false,
          "vocab_path": "",
          "max_seq_len": 1,
          "truncation_type": 0,
          "padding_id": 0,
          "embedding": null
        }
      ],
      "config": {
        "data_path": "./data/train_data",
        "shuffle": false,    # 这里必须设置为false，不进行数据打乱
        "batch_size": 1,     # 这里的batch_size必须设置为1，否则会有padding数据的生成的噪音
        "epoch": 1,          # 迭代次数必须为1
        "sampling_rate": 1.0,
        "need_data_distribute": false,
        "need_generate_examples": true
      }
    }
    },
    "model": {
      "type": "ErnieClassification",
      "is_dygraph": 1,
      "optimization": {     ## 和训练时保持一致即可
        "learning_rate": 2e-05,
        "use_lr_decay": false,    ## 这里必须设置为false
        "warmup_steps": 0,
        "warmup_proportion": 0.1,
        "weight_decay": 0.01,
        "use_dynamic_loss_scaling": false,
        "init_loss_scaling": 128,
        "incr_every_n_steps": 100,
        "decr_every_n_nan_or_inf": 2,
        "incr_ratio": 2.0,
        "decr_ratio": 0.8
      },
      "embedding": {
        "config_path": "../../models_hub/ernie_3.0_base_ch_dir/ernie_config.json"
      },
      "num_labels": 2
    },
    "trainer": {
      "type": "CustomDynamicTrainer",
      "PADDLE_PLACE_TYPE": "gpu",
      "PADDLE_IS_FLEET": 0,
      "train_log_step": 10,
      "use_amp": true,
      "is_eval_dev": 0,
      "is_eval_test": 0,
      "eval_step": 1,
      "save_model_step": 200,
      "load_parameters": "./output/cls_ernie_3.0_base_fc_ch_dy/save_checkpoints/checkpoints_step_501/",   ## 上一步训练出来的模型checkpoints文件
      "load_checkpoint": "",
      "load_checkpoint": "",
      "pre_train_model": [],
      "output_path": "./output/analysis_result.txt",   ## 输出结果的保存路径，这里保存的是模型预测出的训练数据中的重要贡献特征及其频次。
      "extra_param": {
        "meta":{
          "job_type": "text_classification"
        }
      }
    }
  }

开始运行：

# 使用当前目录下的run_data_distribution_correct.py 脚本进行数据偏置分析。
python run_data_distribution_correct.py --param_path=./examples/cls_ernie_fc_ch_data_distribution_correct.json
# 运行完成后，结果将保存在./output/analysis_result.txt中。

分析结果如下所示，第一列为重要特征（token），第二列为其对应词频。

不错    59
酒店    25
服务    14
没有    13
价格    12
可以    10
好      10
系统    9
麻烦    8
性价比  8

对不均衡样本重复采用：

对不均衡的样本依据前一步统计出的重要特征进行重复采样。

# 使用工作目录下的run_balance_data.py 对不均衡的样本进行重复采样，该脚本有三个必填参数
# train_path：表示待均衡的训练数据的路径，具体到单个文件。
# rationale_path：前一步分析出来的重要特征及其频次文件的路径
# output_path：最终均衡过后的训练数据保存路径
python run_balance_data.py --train_path=./data/train_data/train.txt --rationale_path=./output/analysis_result.txt --output=./output/output_train.txt

（New）进阶任务：脏数据识别