&emsp; 从这篇开始，我们将开启**`fastNLP v0.8 tutorial`的`example`系列**，在接下来的

&emsp; 每篇`tutorial`里，我们将会介绍`fastNLP v0.8`在自然语言处理任务上的应用实例

# E1. 使用 Bert + fine-tuning 完成 SST-2 分类

&emsp; 1 &ensp; 基础介绍：`GLUE`通用语言理解评估、`SST-2`文本情感二分类数据集 

&emsp; 2 &ensp; 准备工作：加载`tokenizer`、预处理`dataset`、`dataloader`使用

&emsp; 3 &ensp; 模型训练：加载`distilbert-base`、`fastNLP`参数匹配、`fine-tuning`

In [1]:
import torch
import torch.nn as nn
from torch.optim import AdamW
from torch.utils.data import DataLoader, Dataset

import transformers
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification

import sys
sys.path.append('..')

import fastNLP
from fastNLP import Trainer
from fastNLP import Accuracy

print(transformers.__version__)

4.18.0


### 1. 基础介绍：GLUE 通用语言理解评估、SST-2 文本情感二分类数据集

&emsp; 本示例使用`GLUE`评估基准中的`SST-2`数据集，通过`fine-tuning`方式

&emsp; &emsp; 调整`distilbert-bert`分类模型，以下首先简单介绍下`GLUE`和`SST-2`

**`GLUE`**，**全称`General Language Understanding Evaluation`**，**通用语言理解评估**，

&emsp; 包含9个数据集，各语料的语言均为英语，涉及多个自然语言理解`NLU`任务，包括

&emsp; &emsp; **`CoLA`**，文本分类任务，预测单句语法正误分类；**`SST-2`**，文本分类任务，预测单句情感二分类

&emsp; &emsp; **`MRPC`**，句对分类任务，预测句对语义一致性；**`STS-B`**，相似度打分任务，预测句对语义相似度回归

&emsp; &emsp; **`QQP`**，句对分类任务，预测问题对语义一致性；**`MNLI`**，文本推理任务，预测句对蕴含/矛盾/中立预测

&emsp; &emsp; **`QNLI`/`RTE`/`WNLI`**，文本推理，预测是否蕴含二分类（其中，`QNLI`从`SQuAD`转化而来

&emsp; 诸如`BERT`、`T5`等经典模型都会在此基准上验证效果，更多参考[GLUE论文](https://arxiv.org/pdf/1804.07461v3.pdf)

&emsp; &emsp; 此处，我们使用`SST-2`来训练`bert`，实现文本分类，其他任务描述见下图

In [2]:
GLUE_TASKS = ['cola', 'mnli', 'mrpc', 'qnli', 'qqp', 'rte', 'sst2', 'stsb', 'wnli']

task = 'sst2'

<img src="./figures/E1-fig-glue-benchmark.png" width="70%" height="70%" align="center"></img>

**`SST`**，**全称`Stanford Sentiment Treebank`**，**斯坦福情感树库**，**单句情感分类**数据集

&emsp; 包含电影评论语句和对应的情感极性，1 对应`positive` 正面情感，0 对应`negative` 负面情感

&emsp; 数据集包括三部分：训练集 67350 条，验证集 873 条，测试集 1821 条，更多参考[下载链接](https://gluebenchmark.com/tasks)

对应到代码上，此处使用`datasets`模块中的`load_dataset`函数，指定`SST-2`数据集，自动加载

&emsp; 首次下载后会保存至`~/.cache/huggingface/modules/datasets_modules/datasets/glue/`目录下

In [3]:
from datasets import load_dataset

dataset = load_dataset('glue', task)

Reusing dataset glue (/remote-home/xrliu/.cache/huggingface/datasets/glue/sst2/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad)


  0%|          | 0/3 [00:00<?, ?it/s]

&emsp; 加载之后，根据`GLUE`中`SST-2`数据集的格式，尝试打印部分数据，检查加载结果

In [4]:
task_to_keys = {
    'cola': ('sentence', None),
    'mnli': ('premise', 'hypothesis'),
    'mnli': ('premise', 'hypothesis'),
    'mrpc': ('sentence1', 'sentence2'),
    'qnli': ('question', 'sentence'),
    'qqp': ('question1', 'question2'),
    'rte': ('sentence1', 'sentence2'),
    'sst2': ('sentence', None),
    'stsb': ('sentence1', 'sentence2'),
    'wnli': ('sentence1', 'sentence2'),
}

sentence1_key, sentence2_key = task_to_keys[task]

if sentence2_key is None:
    print(f"Sentence: {dataset['train'][0][sentence1_key]}")
else:
    print(f"Sentence 1: {dataset['train'][0][sentence1_key]}")
    print(f"Sentence 2: {dataset['train'][0][sentence2_key]}")

Sentence: hide new secretions from the parental units 


### 2. 准备工作：加载 tokenizer、预处理 dataset、dataloader 使用

&emsp; 接下来进入模型训练的准备工作，分别需要使用`tokenizer`模块对数据集进行分词与标注

&emsp; &emsp; 定义`SeqClsDataset`对应`dataloader`模块用来实现数据集在训练/测试时的加载

此处的`tokenizer`和`SequenceClassificationModel`都是基于**`distilbert-base-uncased`模型**

&emsp; 即使用较小的、不区分大小写的数据集，**对`bert-base`进行知识蒸馏后的版本**，结构上

&emsp; 包含**1个编码层**、**6个自注意力层**，**参数量`66M`**，详解见本篇末尾，更多请参考[DistilBert论文](https://arxiv.org/pdf/1910.01108.pdf)

首先，通过从`transformers`库中导入**`AutoTokenizer`模块**，**使用`from_pretrained`函数初始化**

&emsp; 此处的`use_fast`表示是否使用`tokenizer`的快速版本；尝试序列化示例数据，检查加载结果

&emsp; 需要注意的是，处理后返回的两个键值，**`'input_ids'`**表示原始文本对应的词素编号序列

&emsp; &emsp; **`'attention_mask'`**表示自注意力运算时的掩模（标上`0`的部分对应`padding`的内容

In [5]:
model_checkpoint = 'distilbert-base-uncased'

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=True)

print(tokenizer("Hello, this one sentence!", "And this sentence goes with it."))

{'input_ids': [101, 7592, 1010, 2023, 2028, 6251, 999, 102, 1998, 2023, 6251, 3632, 2007, 2009, 1012, 102], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


接着，定义预处理函数，**通过`dataset.map`方法**，**将数据集中的文本**，**替换为词素编号序列**

In [6]:
def preprocess_function(examples):
    if sentence2_key is None:
        return tokenizer(examples[sentence1_key], truncation=True)
    return tokenizer(examples[sentence1_key], examples[sentence2_key], truncation=True)

encoded_dataset = dataset.map(preprocess_function, batched=True)

Loading cached processed dataset at /remote-home/xrliu/.cache/huggingface/datasets/glue/sst2/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-ca1fbe5e8eb059f3.arrow
Loading cached processed dataset at /remote-home/xrliu/.cache/huggingface/datasets/glue/sst2/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-03661263fbf302f5.arrow
Loading cached processed dataset at /remote-home/xrliu/.cache/huggingface/datasets/glue/sst2/1.0.0/dacbe3125aa31d7f70367a07a8a9e72a5a0bfeb5fc42e75c9db75b96da6053ad/cache-fbe8e7a4e4f18f45.arrow


然后，通过继承`torch`中的`Dataset`类，定义`SeqClsDataset`类，需要注意的是

&emsp; 其中，**`__getitem__`函数各返回值引用的键值**，**必须和原始数据集中的属性对应**

&emsp; 例如，`'label'`是`SST-2`数据集中原有的内容（包括`'sentence'`和`'label'`

&emsp; &emsp; `'input_ids'`和`'attention_mask'`则是`tokenizer`处理后添加的字段

In [7]:
class SeqClsDataset(Dataset):
    def __init__(self, dataset):
        Dataset.__init__(self)
        self.dataset = dataset

    def __len__(self):
        return len(self.dataset)

    def __getitem__(self, item):
        item = self.dataset[item]
        return item['input_ids'], item['attention_mask'], [item['label']] 

再然后，**定义校对函数`collate_fn`对齐同个`batch`内的每笔数据**，需要注意的是该函数的

&emsp; **返回值必须是字典**，**键值必须同待训练模型的`train_step`和`evaluate_step`函数的参数**

&emsp; **相对应**；这也就是在`tutorial-0`中便被强调的，`fastNLP v0.8`的第一条**参数匹配**机制

In [8]:
def collate_fn(batch):
    input_ids, atten_mask, labels = [], [], []
    max_length = [0] * 3
    for each_item in batch:
        input_ids.append(each_item[0])
        max_length[0] = max(max_length[0], len(each_item[0]))
        atten_mask.append(each_item[1])
        max_length[1] = max(max_length[1], len(each_item[1]))
        labels.append(each_item[2])
        max_length[2] = max(max_length[2], len(each_item[2]))

    for i in range(3):
        each = (input_ids, atten_mask, labels)[i]
        for item in each:
            item.extend([0] * (max_length[i] - len(item)))
    return {'input_ids': torch.cat([torch.tensor([item]) for item in input_ids], dim=0),
            'attention_mask': torch.cat([torch.tensor([item]) for item in atten_mask], dim=0),
            'labels': torch.cat([torch.tensor(item) for item in labels], dim=0)}

最后，分别对`tokenizer`处理过的训练集数据、验证集数据，进行预处理和批量划分

In [9]:
dataset_train = SeqClsDataset(encoded_dataset['train'])
dataloader_train = DataLoader(dataset=dataset_train, 
                              batch_size=32, shuffle=True, collate_fn=collate_fn)
dataset_valid = SeqClsDataset(encoded_dataset['validation'])
dataloader_valid = DataLoader(dataset=dataset_valid, 
                              batch_size=32, shuffle=False, collate_fn=collate_fn)

### 3. 模型训练：加载 distilbert-base、fastNLP 参数匹配、fine-tuning

&emsp; 最后就是模型训练的，分别需要使用`distilbert-base-uncased`搭建分类模型

&emsp; &emsp; 初始化优化器`optimizer`、训练模块`trainer`，通过`run`函数完成训练

此处使用的`nn.Module`模块搭建模型，与`tokenizer`类似，通过从`transformers`库中

&emsp; 导入`AutoModelForSequenceClassification`模块，基于`distilbert-base-uncased`模型初始

需要注意的是**`AutoModelForSequenceClassification`模块的输入参数和输出结构**

&emsp; 一方面，可以**通过输入标签值`labels`**，**使用模块内的损失函数计算损失`loss`**

&emsp; &emsp; 并且可以选择输入是词素编号序列`input_ids`，还是词素嵌入序列`inputs_embeds`

&emsp; 另方面，该模块不会直接输出预测结果，而是会**输出各预测分类上的几率`logits`**

&emsp; &emsp; 基于上述描述，此处完成了中`train_step`和`evaluate_step`函数的定义

&emsp; &emsp; 同样需要注意，函数的返回值体现了`fastNLP v0.8`的第二条**参数匹配**机制

In [10]:
class SeqClsModel(nn.Module):
    def __init__(self, num_labels, model_checkpoint):
        nn.Module.__init__(self)
        self.num_labels = num_labels
        self.back_bone = AutoModelForSequenceClassification.from_pretrained(model_checkpoint, 
                                                                            num_labels=num_labels)

    def forward(self, input_ids, attention_mask, labels=None):
        output = self.back_bone(input_ids=input_ids, 
                                attention_mask=attention_mask, labels=labels)
        return output

    def train_step(self, input_ids, attention_mask, labels):
        loss = self(input_ids, attention_mask, labels).loss
        return {'loss': loss}

    def evaluate_step(self, input_ids, attention_mask, labels):
        pred = self(input_ids, attention_mask, labels).logits
        pred = torch.max(pred, dim=-1)[1]
        return {'pred': pred, 'target': labels}

接着，通过确定分类数量初始化模型实例，同时调用`torch.optim.AdamW`模块初始化优化器

In [11]:
num_labels = 3 if task == 'mnli' else 1 if task == 'stsb' else 2

model = SeqClsModel(num_labels=num_labels, model_checkpoint=model_checkpoint)

optimizers = AdamW(params=model.parameters(), lr=5e-5)

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.bias', 'vocab_projector.bias', 'vocab_layer_norm.weight', 'vocab_projector.weight', 'vocab_transform.weight', 'vocab_layer_norm.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'pre_classifier.weight', 'classifi

然后，使用之前完成的`dataloader_train`和`dataloader_valid`，定义训练模块`trainer`

In [12]:
trainer = Trainer(
    model=model,
    driver='torch',
    device=0,  # 'cuda'
    n_epochs=10,
    optimizers=optimizers,
    train_dataloader=dataloader_train,
    evaluate_dataloaders=dataloader_valid,
    metrics={'acc': Accuracy()}
)

最后，使用`trainer.run`方法，训练模型，`n_epochs`参数中已经指定需要迭代`10`轮

&emsp; `num_eval_batch_per_dl`参数则指定每次只对验证集中的`10`个`batch`进行评估

In [13]:
trainer.run(num_eval_batch_per_dl=10)

Output()

Output()

In [14]:
trainer.evaluator.run()

Output()

{'acc#acc': 0.884174, 'total#acc': 872.0, 'correct#acc': 771.0}

#### 附：`DistilBertForSequenceClassification`模块结构

```
<bound method DistilBertForSequenceClassification.forward of DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
            (lin1): Linear(in_features=768, out_features=3072, bias=True)
            (lin2): Linear(in_features=3072, out_features=768, bias=True)
            (activation): GELUActivation()
          )
          (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        )
        (1): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
            (lin1): Linear(in_features=768, out_features=3072, bias=True)
            (lin2): Linear(in_features=3072, out_features=768, bias=True)
            (activation): GELUActivation()
          )
          (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        )
        (2): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
            (lin1): Linear(in_features=768, out_features=3072, bias=True)
            (lin2): Linear(in_features=3072, out_features=768, bias=True)
            (activation): GELUActivation()
          )
          (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        )
        (3): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
            (lin1): Linear(in_features=768, out_features=3072, bias=True)
            (lin2): Linear(in_features=3072, out_features=768, bias=True)
            (activation): GELUActivation()
          )
          (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        )
        (4): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
            (lin1): Linear(in_features=768, out_features=3072, bias=True)
            (lin2): Linear(in_features=3072, out_features=768, bias=True)
            (activation): GELUActivation()
          )
          (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        )
        (5): TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
            (lin1): Linear(in_features=768, out_features=3072, bias=True)
            (lin2): Linear(in_features=3072, out_features=768, bias=True)
            (activation): GELUActivation()
          )
          (output_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
        )
      )
    )
  )
  (pre_classifier): Linear(in_features=768, out_features=768, bias=True)
  (classifier): Linear(in_features=768, out_features=2, bias=True)
  (dropout): Dropout(p=0.2, inplace=False)
)>
```