new add mass and unet chinese readme.

5 years ago · 338c82d57e
--- a/model_zoo/official/cv/unet/README_CN.md
+++ b/model_zoo/official/cv/unet/README_CN.md
@@ -0,0 +1,302 @@
 # 目录
 <!-- TOC -->
 - [目录](#目录)
 - [U-Net说明](#u-net说明)
 - [模型架构](#模型架构)
 - [数据集](#数据集)
 - [环境要求](#环境要求)
 - [快速入门](#快速入门)
 - [脚本说明](#脚本说明)
    - [脚本及样例代码](#脚本及样例代码)
    - [脚本参数](#脚本参数)
    - [训练过程](#训练过程)
        - [用法](#用法)
        - [分布式训练](#分布式训练)
    - [评估过程](#评估过程)
        - [评估](#评估)
 - [模型描述](#模型描述)
    - [性能](#性能)
        - [评估性能](#评估性能)
    - [用法](#用法-1)
        - [推理](#推理)
        - [继续训练预训练模型](#继续训练预训练模型)
 - [随机情况说明](#随机情况说明)
 - [ModelZoo主页](#modelzoo主页)
 <!-- /TOC -->
 # U-Net说明
 U-Net医学模型基于二维图像分割。实现方式见论文[UNet：Convolutional Networks for Biomedical Image Segmentation](https://arxiv.org/abs/1505.04597)。在2015年ISBI细胞跟踪竞赛中，U-Net获得了许多最佳奖项。论文中提出了一种用于医学图像分割的网络模型和数据增强方法，有效利用标注数据来解决医学领域标注数据不足的问题。U型网络结构也用于提取上下文和位置信息。
 [论文](https://arxiv.org/abs/1505.04597)：  Olaf Ronneberger, Philipp Fischer, Thomas Brox. "U-Net: Convolutional Networks for Biomedical Image Segmentation." *conditionally accepted at MICCAI 2015*. 2015.
 # 模型架构
 具体而言，U-Net的U型网络结构可以更好地提取和融合高层特征，获得上下文信息和空间位置信息。U型网络结构由编码器和解码器组成。编码器由两个3x3卷积和一个2x2最大池化迭代组成。每次下采样后通道数翻倍。解码器由2x2反卷积、拼接层和2个3x3卷积组成，经过1x1卷积后输出。
 # 数据集
 使用的数据集： [ISBI Challenge](http://brainiac2.mit.edu/isbi_challenge/home)
 - 说明：训练和测试数据集为两组30节果蝇一龄幼虫腹神经索（VNC）的连续透射电子显微镜（ssTEM）数据集。微立方体的尺寸约为2 x 2 x 1.5微米，分辨率为4x4x50纳米/像素。
 - 许可证：您可以免费使用这个数据集来生成或测试非商业图像分割软件。若科学出版物使用此数据集，则必须引用TrakEM2和以下出版物： Cardona A, Saalfeld S, Preibisch S, Schmid B, Cheng A, Pulokas J, Tomancak P, Hartenstein V. 2010. An Integrated Micro- and Macroarchitectural Analysis of the Drosophila Brain by Computer-Assisted Serial Section Electron Microscopy. PLoS Biol 8(10): e1000502. doi:10.1371/journal.pbio.1000502.
 - 数据集大小：22.5 MB
    - 训练集：15 MB，30张图像（训练数据包含2个多页TIF文件，每个文件包含30张2D图像。train-volume.tif和train-labels.tif分别包含数据和标签。）
    - 验证集：（我们随机将训练数据分成5份，通过5折交叉验证来评估模型。）
    - 测试集：7.5 MB，30张图像（测试数据包含1个多页TIF文件，文件包含30张2D图像。test-volume.tif包含数据。）
 - 数据格式：二进制文件（TIF）
    - 注意：数据在src/data_loader.py中处理
 # 环境要求
 - 硬件（Ascend）
    - 准备Ascend处理器搭建硬件环境。如需试用昇腾处理器，请发送[申请表](https://obs-9be7.obs.cn-east-2.myhuaweicloud.com/file/other/Ascend%20Model%20Zoo%E4%BD%93%E9%AA%8C%E8%B5%84%E6%BA%90%E7%94%B3%E8%AF%B7%E8%A1%A8.docx)至ascend@huawei.com，审核通过即可获得资源。
 - 框架
    - [MindSpore](https://www.mindspore.cn/install)
 - 如需查看详情，请参见如下资源：
    - [MindSpore教程](https://www.mindspore.cn/tutorial/training/zh-CN/master/index.html)
    - [MindSpore Python API](https://www.mindspore.cn/doc/api_python/zh-CN/master/index.html)
 # 快速入门
 通过官方网站安装MindSpore后，您可以按照如下步骤进行训练和评估：
 - Ascend处理器环境运行
  ```python
  # 训练示例
  python train.py --data_url=/path/to/data/ > train.log 2>&1 &
  OR
  bash scripts/run_standalone_train.sh [DATASET]
  # 分布式训练示例
  bash scripts/run_distribute_train.sh [RANK_TABLE_FILE] [DATASET]
  # 评估示例
  python eval.py --data_url=/path/to/data/ --ckpt_path=/path/to/checkpoint/ > eval.log 2>&1 &
  OR
  bash scripts/run_standalone_eval.sh [DATASET] [CHECKPOINT]
  ```
 # 脚本说明
 ## 脚本及样例代码
 ```path
 ├── model_zoo
    ├── README.md                           // 所有模型相关说明
    ├── unet
        ├── README.md                       // U-Net相关说明
        ├── scripts
        │   ├──run_standalone_train.sh      // Ascend分布式shell脚本
        │   ├──run_standalone_eval.sh       // Ascend评估shell脚本
        ├── src
        │   ├──config.py                    // 参数配置
        │   ├──data_loader.py               // 创建数据集
        │   ├──loss.py                      // 损失
        │   ├──utils.py                     // 通用组件（回调函数）
        │   ├──unet.py                      // U-Net架构
                ├──__init__.py              // 初始化文件
                ├──unet_model.py            // U-Net模型
                ├──unet_parts.py            // U-Net部分
        ├── train.py                        // 训练脚本
        ├──launch_8p.py                     // 训练8P脚本
        ├── eval.py                         // 评估脚本
 ```
 ## 脚本参数
 在config.py中可以同时配置训练参数和评估参数。
 - U-Net配置，ISBI数据集
  ```python
  'name': 'Unet',                     # 模型名称
  'lr': 0.0001,                       # 学习率
  'epochs': 400,                      # 运行1p时的总训练轮次
  'distribute_epochs': 1600,          # 运行8p时的总训练轮次
  'batchsize': 16,                    # 训练批次大小
  'cross_valid_ind': 1,               # 交叉验证指标
  'num_classes': 2,                   # 数据集类数
  'num_channels': 1,                  # 通道数
  'keep_checkpoint_max': 10,          # 只保留最后一个keep_checkpoint_max检查点
  'weight_decay': 0.0005,             # 权重衰减值
  'loss_scale': 1024.0,               # 损失放大
  'FixedLossScaleManager': 1024.0,    # 固定损失放大
  'resume': False,                    # 是否使用预训练模型训练
  'resume_ckpt': './',                # 预训练模型路径
  ```
 ## 训练过程
 ### 用法
 - Ascend处理器环境运行
  ```shell
  python train.py --data_url=/path/to/data/ > train.log 2>&1 &
  OR
  bash scripts/run_standalone_train.sh [DATASET]
  ```
  上述python命令在后台运行，可通过`train.log`文件查看结果。
  训练结束后，您可以在默认脚本文件夹中找到检查点文件。损失值如下：
  ```shell
  # grep "loss is " train.log
  step: 1, loss is 0.7011719, fps is 0.25025035060906264
  step: 2, loss is 0.69433594, fps is 56.77693756377044
  step: 3, loss is 0.69189453, fps is 57.3293877244179
  step: 4, loss is 0.6894531, fps is 57.840651522059716
  step: 5, loss is 0.6850586, fps is 57.89903776054361
  step: 6, loss is 0.6777344, fps is 58.08073627299014
  ...  
  step: 597, loss is 0.19030762, fps is 58.28088370287449
  step: 598, loss is 0.19958496, fps is 57.95493929352674
  step: 599, loss is 0.18371582, fps is 58.04039977720966
  step: 600, loss is 0.22070312, fps is 56.99692546024671
  ```
  模型检查点储存在当前路径中。
 ### 分布式训练
 ```shell
 bash scripts/run_distribute_train.sh [RANK_TABLE_FILE] [DATASET]
 ```
 上述shell脚本在后台运行分布式训练。可通过`logs/device[X]/log.log`文件查看结果。损失值如下：
 ```shell
 # grep "loss is" logs/device0/log.log
 step: 1, loss is 0.70524895, fps is 0.15914689861221412
 step: 2, loss is 0.6925452, fps is 56.43668656967454
 ...
 step: 299, loss is 0.20551169, fps is 58.4039329983891
 step: 300, loss is 0.18949677, fps is 57.63118508760329
 ```
 ## 评估过程
 ### 评估
 - Ascend处理器环境运行评估ISBI数据集
  在运行以下命令之前，请检查用于评估的检查点路径。将检查点路径设置为绝对全路径，如"username/unet/ckpt_unet_medical_adam-48_600.ckpt"。
  ```shell
  python eval.py --data_url=/path/to/data/ --ckpt_path=/path/to/checkpoint/ > eval.log 2>&1 &
  OR
  bash scripts/run_standalone_eval.sh [DATASET] [CHECKPOINT]
  ```
  上述python命令在后台运行。可通过"eval.log"文件查看结果。测试数据集的准确率如下：
  ```shell
  # grep "Cross valid dice coeff is:" eval.log
  ============== Cross valid dice coeff is: {'dice_coeff': 0.9085704886070473}
  ```
 # 模型描述
 ## 性能
 ### 评估性能
 | 参数                 | Ascend     |
 | -------------------------- | ------------------------------------------------------------ |
 | 模型版本 | U-Net |
 | 资源 | Ascend 910；CPU：2.60GHz，192核；内存：755 GB |
 | 上传日期 | 2020-9-15 |
 | MindSpore版本 | 1.0.0 |
 | 数据集             | ISBI                                                         |
 | 训练参数   | 1pc: epoch=400, total steps=600, batch_size = 16, lr=0.0001  |
 |                            | 8pc: epoch=1600, total steps=300, batch_size = 16, lr=0.0001 |
 | 优化器 | ADAM |
 | 损失函数              | Softmax交叉熵                                         |
 | 输出 | 概率 |
 | 损失 | 0.22070312                                                   |
 | 速度 | 1卡：267毫秒/步；8卡：280毫秒/步 |
 | 总时长 | 1卡：2.67分钟；8卡：1.40分钟 |
 | 参数(M)  | 93M                                                       |
 | 微调检查点 | 355.11M (.ckpt文件)                                         |
 | 脚本                    | [U-Net脚本](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/official/cv/unet) |
 ## 用法
 ### 推理
 如果您需要使用训练好的模型在Ascend 910、Ascend 310等多个硬件平台上进行推理上进行推理，可参考此[链接](https://www.mindspore.cn/tutorial/training/zh-CN/master/advanced_use/migrate_3rd_scripts.html)。下面是一个简单的操作步骤示例：
 - Ascend处理器环境运行
  ```python
  # 设置上下文
  device_id = int(os.getenv('DEVICE_ID'))
  context.set_context(mode=context.GRAPH_MODE, device_target="Ascend",save_graphs=True,device_id=device_id)
  # 加载未知数据集进行推理
  _, valid_dataset = create_dataset(data_dir, 1, 1, False, cross_valid_ind, False)
  # 定义模型并加载预训练模型
  net = UNet(n_channels=cfg['num_channels'], n_classes=cfg['num_classes'])
  param_dict= load_checkpoint(ckpt_path)
  load_param_into_net(net , param_dict)
  criterion = CrossEntropyWithLogits()
  model = Model(net, loss_fn=loss, optimizer=opt, metrics={'acc'})
  # 对未知数据集进行预测
  print("============== Starting Evaluating ============")
  dice_score = model.eval(valid_dataset, dataset_sink_mode=False)
  print("============== Cross valid dice coeff is:", dice_score)
  ```
 ### 继续训练预训练模型
 - Ascend处理器环境运行
  ```python
  # 定义模型
  net = UNet(n_channels=cfg['num_channels'], n_classes=cfg['num_classes'])
  #如果'resume'为True，则继续训练
  if cfg['resume']:
      param_dict = load_checkpoint(cfg['resume_ckpt'])
      load_param_into_net(net, param_dict)
  # 加载数据集
  train_dataset, _ = create_dataset(data_dir, epochs, batch_size, True, cross_valid_ind, run_distribute)
  train_data_size = train_dataset.get_dataset_size()
  optimizer = nn.Adam(params=net.trainable_params(), learning_rate=lr, weight_decay=cfg['weight_decay'],
                        loss_scale=cfg['loss_scale'])
  criterion = CrossEntropyWithLogits()
  loss_scale_manager = mindspore.train.loss_scale_manager.FixedLossScaleManager(cfg['FixedLossScaleManager'], False)
  model = Model(net, loss_fn=criterion, loss_scale_manager=loss_scale_manager, optimizer=optimizer, amp_level="O3")
  # 设置回调
  ckpt_config = CheckpointConfig(save_checkpoint_steps=train_data_size,
                                 keep_checkpoint_max=cfg['keep_checkpoint_max'])
  ckpoint_cb = ModelCheckpoint(prefix='ckpt_unet_medical_adam',
                               directory='./ckpt_{}/'.format(device_id),
                               config=ckpt_config)
  print("============== Starting Training ==============")
  model.train(1, train_dataset, callbacks=[StepLossTimeMonitor(batch_size=batch_size), ckpoint_cb],
              dataset_sink_mode=False)
  print("============== End Training ==============")
  ```
 # 随机情况说明
 dataset.py中设置了“seet_sed”函数内的种子，同时还使用了train.py中的随机种子。
 # ModelZoo主页
 请浏览官网[主页](https://gitee.com/mindspore/mindspore/tree/master/model_zoo)。  
--- a/model_zoo/official/nlp/mass/README.md
+++ b/model_zoo/official/nlp/mass/README.md
@@ -1,4 +1,4 @@
 ![](https://www.mindspore.cn/static/img/logo.a3e472c9.png)
 # Contexts
 <!-- TOC -->
@@ -38,10 +38,9 @@
 <!-- /TOC -->
 # MASS: Masked Sequence to Sequence Pre-training for Language Generation Description
 [MASS: Masked Sequence to Sequence Pre-training for Language Generation](https://www.microsoft.com/en-us/research/uploads/prod/2019/06/MASS-paper-updated-002.pdf) was released by MicroSoft in June 2019. 
 [MASS: Masked Sequence to Sequence Pre-training for Language Generation](https://www.microsoft.com/en-us/research/uploads/prod/2019/06/MASS-paper-updated-002.pdf) was released by MicroSoft in June 2019.
 BERT(Devlin et al., 2018) have achieved SOTA in natural language understanding area by pre-training the encoder part of Transformer(Vaswani et al., 2017) with masked rich-resource text. Likewise, GPT(Raddford et al., 2018) pre-trains the decoder part of Transformer with masked(encoder inputs are masked) rich-resource text. Both of them build a robust language model by pre-training with masked rich-resource text.
@@ -53,32 +52,31 @@ Inspired by BERT, GPT and other language models, MicroSoft addressed [MASS: Mask
 # Model Architecture
 The MASS network is implemented by Transformer, which has multi-encoder layers and multi-decoder layers. 
 For pre-training, we use the Adam optimizer and loss-scale to get the pre-trained model. 
 During fine-turning, we fine-tune this pre-trained model with different dataset according to different tasks. 
 During testing, we use the fine-turned model to predict the result, and adopt a beam search algorithm to 
 The MASS network is implemented by Transformer, which has multi-encoder layers and multi-decoder layers.
 For pre-training, we use the Adam optimizer and loss-scale to get the pre-trained model.
 During fine-turning, we fine-tune this pre-trained model with different dataset according to different tasks.
 During testing, we use the fine-turned model to predict the result, and adopt a beam search algorithm to
 get the most possible prediction results.
 # Dataset
 Note that you can run the scripts based on the dataset mentioned in original paper or widely used in relevant domain/network architecture. In the following sections, we will introduce how to run the scripts using the related dataset below.
 Dataset used: 
 Dataset used:
 - monolingual English data from News Crawl dataset(WMT 2019) for pre-training.
 - Gigaword Corpus(Graff et al., 2003) for Text Summarization.
 - Cornell movie dialog corpus(DanescuNiculescu-Mizil & Lee, 2011).
 Details about those dataset could be found in [MASS: Masked Sequence to Sequence Pre-training for Language Generation](https://www.microsoft.com/en-us/research/uploads/prod/2019/06/MASS-paper-updated-002.pdf).
 # Features
 Mass is designed to jointly pre train encoder and decoder to complete the task of language generation. 
 Mass is designed to jointly pre train encoder and decoder to complete the task of language generation.
 First of all, through a sequence to sequence framework, mass only predicts the blocked token, which forces the encoder to understand the meaning of the unshielded token, and encourages the decoder to extract useful information from the encoder.
 Secondly, by predicting the continuous token of the decoder, the decoder can build better language modeling ability than only predicting discrete token.
 Third, by further shielding the input token of the decoder which is not shielded in the encoder, the decoder is encouraged to extract more useful information from the encoder side, rather than using the rich information in the previous token.
 # Script description
 MASS script and code structure are as follow:
@@ -90,7 +88,7 @@ MASS script and code structure are as follow:
  │   ├──config.py                           // Configuration instance definition.
  │   ├──config.json                         // Configuration file.
  ├── src
  │   ├──dataset                             
  │   ├──dataset
  │      ├──bi_data_loader.py                // Dataset loader for fine-tune or inferring.
  │      ├──mono_data_loader.py              // Dataset loader for pre-training.
  │   ├──language_model
@@ -129,7 +127,7 @@ MASS script and code structure are as follow:
  │   ├──run_gpu.sh                          // GPU train & evaluate model script.
  │   ├──learn_subword.sh                    // Learn BPE codes.
  │   ├──stop_training.sh                    // Stop training.
  ├── requirements.txt                       // Requirements of third party package. 
  ├── requirements.txt                       // Requirements of third party package.
  ├── train.py                               // Train API entry.
  ├── eval.py                                // Infer API entry.
  ├── tokenize_corpus.py                     // Corpus tokenization.
@@ -141,40 +139,40 @@ MASS script and code structure are as follow:
 ```
 ## Data Preparation
 The data preparation of a natural language processing task contains data cleaning, tokenization, encoding and vocabulary generation steps.
 In our experiments, using [Byte Pair Encoding(BPE)](https://arxiv.org/abs/1508.07909) could reduce size of vocabulary, and relieve the OOV influence effectively.
 Vocabulary could be created using `src/utils/dictionary.py` with text dictionary which is learnt from BPE. 
 Vocabulary could be created using `src/utils/dictionary.py` with text dictionary which is learnt from BPE.
 For more detail about BPE, please refer to [Subword-nmt lib](https://www.cnpython.com/pypi/subword-nmt) or [paper](https://arxiv.org/abs/1508.07909).
 In our experiments, vocabulary was learned based on 1.9M sentences from News Crawl Dataset, size of vocabulary is 45755.
 Here, we have a brief introduction of data preparation scripts.
 ### Tokenization
 Using `tokenize_corpus.py` could tokenize corpus whose text files are in format of `.txt`.
 Major parameters in `tokenize_corpus.py`:
 ```bash
 --corpus_folder:     Corpus folder path, if multi-folders are provided, use ',' split folders. 
 --output_folder:     Output folder path. 
 --corpus_folder:     Corpus folder path, if multi-folders are provided, use ',' split folders.
 --output_folder:     Output folder path.
 --tokenizer:         Tokenizer to be used, nltk or jieba, if nltk is not installed fully, use jieba instead.
 --pool_size:         Processes pool size.
 ```
 Sample code:
 ```bash
 python tokenize_corpus.py --corpus_folder /{path}/corpus --output_folder /{path}/tokenized_corpus --tokenizer {nltk|jieba} --pool_size 16
 ```
 ### Byte Pair Encoding
 After tokenization, BPE is applied to tokenized corpus with provided `all.bpe.codes`.
 Apply BPE script can be found in `apply_bpe_encoding.py`.
@@ -192,6 +190,7 @@ Major parameters in `apply_bpe_encoding.py`:
 ```
 Sample code:
 ```bash
 python tokenize_corpus.py --codes /{path}/all.bpe.codes \
    --src_folder /{path}/tokenized_corpus \
@@ -201,9 +200,10 @@ python tokenize_corpus.py --codes /{path}/all.bpe.codes \
    --processes 32
 ```
 ### Build Vocabulary
 Support that you want to create a new vocabulary, there are two options:
 1. Learn BPE codes from scratch, and create vocabulary with multi vocabulary files from `subword-nmt`.
 2. Create from an existing vocabulary file which lines in the format of `word frequency`.
 3. *Optional*, Create a small vocabulary based on `vocab/all_en.dict.bin` with method of `shink` from `src/utils/dictionary.py`.
@@ -217,6 +217,7 @@ Major interface of `src/utils/dictionary.py` are as follow:
 4. `persistence(self, path)`: Save vocabulary object to binary file.
 Sample code:
 ```python
 from src.utils import Dictionary
@@ -232,11 +233,12 @@ print([vocabulary.index[s] for s in sentence])
 For more detail, please refer to the source file.
 ### Generate Dataset
 As mentioned above, three corpus are used in MASS mode, dataset generation scripts for them are provided.
 #### News Crawl Corpus
 Script can be found in `news_crawl.py`.
 Major parameters in `news_crawl.py`:
@@ -265,8 +267,8 @@ python news_crawl.py --src_folder /{path}/news_crawl \
    --processes 32
 ```
 #### Gigaword Corpus
 Script can be found in `gigaword.py`.
 Major parameters in `gigaword.py`:
@@ -296,8 +298,8 @@ python gigaword.py --train_src /{path}/gigaword/train_src.txt \
    --max_len 64
 ```
 #### Cornell Movie Dialog Corpus
 Script can be found in `cornell_dialog.py`.
 Major parameters in `cornell_dialog.py`:
@@ -324,32 +326,37 @@ python cornell_dialog.py --src_folder /{path}/cornell_dialog \
    --max_len 64
 ```
 ## Configuration
 Json file under the path `config/` is the template configuration file. 
 Almost all of the options and arguments needed could be assigned conveniently, including the training platform, configurations of dataset and model, arguments of optimizer etc. Optional features such as loss scale and checkpoint are also available by setting the options correspondingly. 
 Json file under the path `config/` is the template configuration file.
 Almost all of the options and arguments needed could be assigned conveniently, including the training platform, configurations of dataset and model, arguments of optimizer etc. Optional features such as loss scale and checkpoint are also available by setting the options correspondingly.
 For more detailed information about the attributes, refer to the file `config/config.py`.
 ## Training & Evaluation process
 For training a model, the shell script `run_ascend.sh` or `run_gpu.sh` is all you need. In this scripts, the environment variable is set and the training script `train.py` under `mass` is executed.
 You may start a task training with single device or multiple devices by assigning the options and run the command in bash:
 Ascend:
 ```ascend
 sh run_ascend.sh [--options]
 ```
 GPU:
 ```gpu
 sh run_gpu.sh [--options]
 ```
 The usage of `run_ascend.sh` is shown as bellow:
 ```text
 Usage: run_ascend.sh [-h, --help] [-t, --task <CHAR>] [-n, --device_num <N>]
                     [-i, --device_id <N>] [-j, --hccl_json <FILE>]
                     [-c, --config <FILE>] [-o, --output <FILE>]
                     [-v, --vocab <FILE>]
 options:
    -h, --help               show usage
    -t, --task               select task: CHAR, 't' for train and 'i' for inference".
@@ -361,14 +368,16 @@ options:
    -v, --vocab              set the vocabulary.
    -m, --metric             set the metric.
 ```
 Notes: Be sure to assign the hccl_json file while running a distributed-training.
 The usage of `run_gpu.sh` is shown as bellow:
 ```text
 Usage: run_gpu.sh [-h, --help] [-t, --task <CHAR>] [-n, --device_num <N>]
                     [-i, --device_id <N>] [-c, --config <FILE>]
                     [-o, --output <FILE>] [-v, --vocab <FILE>]
 options:
    -h, --help               show usage
    -t, --task               select task: CHAR, 't' for train and 'i' for inference".
@@ -382,28 +391,32 @@ options:
 The command followed shows a example for training with 2 devices.
 Ascend:
 ```ascend
 sh run_ascend.sh --task t --device_num 2 --hccl_json /{path}/rank_table.json --config /{path}/config.json
 ```
 ps. Discontinuous device id is not supported in `run_ascend.sh` at present, device id in `rank_table.json` must start from 0.
 GPU:
 ```gpu
 sh run_gpu.sh --task t --device_num 2 --config /{path}/config.json
 ```
 If use a single chip, it would be like this:
 Ascend:
 ```ascend
 sh run_ascend.sh --task t --device_num 1 --device_id 0 --config /{path}/config.json
 ```
 GPU:
 ```gpu
 sh run_gpu.sh --task t --device_num 1 --device_id 0 --config /{path}/config.json
 ```
 ## Weights average
 ```python
@@ -411,6 +424,7 @@ python weights_average.py --input_files your_checkpoint_list --output_file model
 ```
 The input_files is a list of you checkpoints file. To use model.npz as the weights, add its path in config.json at "existed_ckpt".
 ```json
 {
  ...
@@ -423,7 +437,6 @@ The input_files is a list of you checkpoints file. To use model.npz as the weigh
 }
 ```
 ## Learning rate scheduler
 Two learning rate scheduler are provided in our model:
@@ -434,6 +447,7 @@ Two learning rate scheduler are provided in our model:
 LR scheduler could be config in `config/config.json`.
 For Polynomial decay scheduler, config could be like:
 ```json
 {
  ...
@@ -451,6 +465,7 @@ For Polynomial decay scheduler, config could be like:
 ```
 For Inverse square root scheduler, config could be like:
 ```json
 {
  ...
@@ -468,19 +483,18 @@ For Inverse square root scheduler, config could be like:
 More detail about LR scheduler could be found in `src/utils/lr_scheduler.py`.
 # Environment Requirements
 ## Platform
 - Hardware（Ascend/GPU）
  - Prepare hardware environment with Ascend or GPU processor. If you want to try Ascend  , please send the [application form](https://obs-9be7.obs.cn-east-2.myhuaweicloud.com/file/other/Ascend%20Model%20Zoo%E4%BD%93%E9%AA%8C%E8%B5%84%E6%BA%90%E7%94%B3%E8%AF%B7%E8%A1%A8.docx) to ascend@huawei.com. Once approved, you can get the resources. 
    - Prepare hardware environment with Ascend or GPU processor. If you want to try Ascend  , please send the [application form](https://obs-9be7.obs.cn-east-2.myhuaweicloud.com/file/other/Ascend%20Model%20Zoo%E4%BD%93%E9%AA%8C%E8%B5%84%E6%BA%90%E7%94%B3%E8%AF%B7%E8%A1%A8.docx) to ascend@huawei.com. Once approved, you can get the resources.
 - Framework
  - [MindSpore](https://www.mindspore.cn/install/en)
    - [MindSpore](https://www.mindspore.cn/install/en)
 - For more information, please check the resources below：
  - [MindSpore Tutorials](https://www.mindspore.cn/tutorial/training/en/master/index.html)
  - [MindSpore Python API](https://www.mindspore.cn/doc/api_python/en/master/index.html)
    - [MindSpore Tutorials](https://www.mindspore.cn/tutorial/training/en/master/index.html)
    - [MindSpore Python API](https://www.mindspore.cn/doc/api_python/en/master/index.html)
 ## Requirements
 ```txt
@@ -490,19 +504,22 @@ subword-nmt
 rouge
 ```
 https://www.mindspore.cn/tutorial/training/en/master/advanced_use/migrate_3rd_scripts.html
 <https://www.mindspore.cn/tutorial/training/en/master/advanced_use/migrate_3rd_scripts.html>
 # Get started
 MASS pre-trains a sequence to sequence model by predicting the masked fragments in an input sequence. After this, downstream tasks including text summarization and conversation response are candidated for fine-tuning the model and for inference.
 Here we provide a practice example to demonstrate the basic usage of MASS for pre-training, fine-tuning a model, and the inference process. The overall process is as follows:
 1. Download and process the dataset.
 2. Modify the `config.json` to config the network.
 3. Run a task for pre-training and fine-tuning.
 4. Perform inference and validation.
 ## Pre-training
 For pre-training a model, config the options in `config.json` firstly:
 - Assign the `pre_train_dataset` under `dataset_config` node to the dataset path.
 - Choose the optimizer('momentum/adam/lamb' is available).
 - Assign the 'ckpt_prefix' and 'ckpt_path' under `checkpoint_path` to save the model files.
@@ -524,7 +541,9 @@ sh run_gpu.sh -t t -n 1 -i 1 -c /mass/config/config.json
 Get the log and output files under the path `./train_mass_*/`, and the model file under the path assigned in the `config/config.json` file.
 ## Fine-tuning
 For fine-tuning a model, config the options in `config.json` firstly:
 - Assign the `fine_tune_dataset` under `dataset_config` node to the dataset path.
 - Assign the `existed_ckpt` under `checkpoint_path` node to the existed model file generated by pre-training.
 - Choose the optimizer('momentum/adam/lamb' is available).
@@ -546,8 +565,10 @@ sh run_gpu.sh -t t -n 1 -i 1 -c config/config.json
 Get the log and output files under the path `./train_mass_*/`, and the model file under the path assigned in the `config/config.json` file.
 ## Inference
 If you need to use the trained model to perform inference on multiple hardware platforms, such as GPU, Ascend 910 or Ascend 310, you can refer to this [Link](https://www.mindspore.cn/tutorial/training/en/master/advanced_use/migrate_3rd_scripts.html).
 For inference, config the options in `config.json` firstly:
 - Assign the `test_dataset` under `dataset_config` node to the dataset path.
 - Assign the `existed_ckpt` under `checkpoint_path` node to the model file produced by fine-tuning.
 - Choose the optimizer('momentum/adam/lamb' is available).
@@ -571,7 +592,8 @@ sh run_gpu.sh -t i -n 1 -i 1 -c config/config.json -o {outputfile}
 ## Results
 ### Fine-Tuning on Text Summarization
 The comparisons between MASS and two other pre-training methods in terms of ROUGE score on the text summarization task 
 The comparisons between MASS and two other pre-training methods in terms of ROUGE score on the text summarization task
 with 3.8M training data are as follows:
 | Method         |  RG-1(F)      | RG-2(F)      | RG-L(F)      |
@@ -579,6 +601,7 @@ with 3.8M training data are as follows:
 | MASS           | Ongoing       | Ongoing      | Ongoing      |
 ### Fine-Tuning on Conversational ResponseGeneration
 The comparisons between MASS and other baseline methods in terms of PPL on Cornell Movie Dialog corpus are as follows:
 | Method             | Data = 10K       |  Data = 110K    |
@@ -603,10 +626,6 @@ The comparisons between MASS and other baseline methods in terms of PPL on Corne
 | Speed                      | 611.45 sentences/s                                                        |
 | Total time                 | --/--                                                                     |
 | Params (M)                 | 44.6M                                                                     |
 | Checkpoint for Fine tuning | ---Mb, --, [A link]()                                                     |
 | Model for inference        | ---Mb, --, [A link]()                                                     |
 | Scripts                    | [A link]()                                                                |
 ### Inference Performance
@@ -622,17 +641,15 @@ The comparisons between MASS and other baseline methods in terms of PPL on Corne
 | Accuracy                   | ppl=23.52 for conversation response, RG-1=29.79 for text summarization. |
 | Speed                      | ---- sentences/s                                           |
 | Total time                 | --/--                                                      |
 | Model for inference        | ---Mb, --, [A link]()                                      |
 # Description of random situation
 MASS model contains dropout operations, if you want to disable dropout, please set related dropout_rate to 0 in `config/config.json`. 
 MASS model contains dropout operations, if you want to disable dropout, please set related dropout_rate to 0 in `config/config.json`.
 # others
 The model has been validated on Ascend environment, not validated on CPU and GPU. 
 The model has been validated on Ascend and GPU environments, not validated on CPU.
 # ModelZoo Homepage  
 [Link](https://gitee.com/mindspore/mindspore/tree/master/model_zoo)
--- a/model_zoo/official/nlp/mass/README_CN.md
+++ b/model_zoo/official/nlp/mass/README_CN.md
@@ -0,0 +1,654 @@
 # 目录
 <!-- TOC -->
 - [目录](#目录)
 - [掩式序列到序列（MASS）预训练语言生成](#掩式序列到序列mass预训练语言生成)
 - [模型架构](#模型架构)
 - [数据集](#数据集)
    - [特性](#特性)
    - [脚本说明](#脚本说明)
    - [准备数据集](#准备数据集)
        - [标记化](#标记化)
        - [字节对编码](#字节对编码)
        - [构建词汇表](#构建词汇表)
        - [生成数据集](#生成数据集)
            - [News Crawl语料库](#news-crawl语料库)
            - [Gigaword语料库](#gigaword语料库)
            - [Cornell电影对白语料库](#cornell电影对白语料库)
    - [配置](#配置)
    - [训练&评估过程](#训练评估过程)
    - [权重平均值](#权重平均值)
    - [学习速率调度器](#学习速率调度器)
 - [环境要求](#环境要求)
    - [平台](#平台)
    - [其他要求](#其他要求)
 - [快速上手](#快速上手)
    - [预训练](#预训练)
    - [微调](#微调)
    - [推理](#推理)
 - [性能](#性能)
    - [结果](#结果)
        - [文本摘要微调](#文本摘要微调)
        - [会话应答微调](#会话应答微调)
        - [训练性能](#训练性能)
        - [推理性能](#推理性能)
 - [随机情况说明](#随机情况说明)
 - [其他](#其他)
 - [ModelZoo主页](#modelzoo主页)
 <!-- /TOC -->
 # 掩式序列到序列（MASS）预训练语言生成
 [掩式序列到序列（MASS）预训练语言生成](https://www.microsoft.com/en-us/research/uploads/prod/2019/06/MASS-paper-updated-002.pdf)由微软于2019年6月发布。
 BERT（Devlin等人，2018年）采用有屏蔽的语料丰富文本预训练Transformer的编码器部分（Vaswani等人，2017年），已在自然语言理解领域实现了性能最优（SOTA）。不仅如此，GPT（Raddford等人，2018年）也采用了有屏蔽的语料丰富文本对Transformer的解码器部分进行预训练（屏蔽了编码器输入）。两者都通过预训练有屏蔽的语料丰富文本来构建一个健壮的语言模型。
 受BERT、GPT及其他语言模型的启发，微软致力于在此基础上研究[掩式序列到序列（MASS）预训练语言生成](https://www.microsoft.com/en-us/research/uploads/prod/2019/06/MASS-paper-updated-002.pdf)。MASS的参数k很重要，用来控制屏蔽后的分片长度。BERT和GPT属于特例，k等于1或者句长。
 [MASS介绍 — 序列对序列语言生成任务中性能优于BERT和GPT的预训练方法](https://www.microsoft.com/en-us/research/blog/introduction-mass-a-pre-training-method-thing-forts-bert-and-gpt-in-sequence-to-sequence-language-generate-tasks/)
 [论文](https://www.microsoft.com/en-us/research/uploads/prod/2019/06/MASS-paper-updated-002.pdf): Song, Kaitao, Xu Tan, Tao Qin, Jianfeng Lu and Tie-Yan Liu.“MASS: Masked Sequence to Sequence Pre-training for Language Generation.”ICML (2019).
 # 模型架构
 MASS网络由Transformer实现，Transformer包括多个编码器层和多个解码器层。
 预训练中，采用Adam优化器和损失放大来得到预训练后的模型。
 微调时，根据不同的任务，采用不同的数据集对预训练的模型进行微调。
 测试过程中，通过微调后的模型预测结果，并采用波束搜索算法
 获取可能性最高的预测结果。
 # 数据集
 本文运用数据集包括：
 - News Crawl数据集（WMT，2019年）的英语单语数据，用于预训练
 - Gigaword语料库（Graff等人，2003年），用于文本摘要
 - Cornell电影对白语料库（DanescuNiculescu-Mizil & Lee，2011年）
 数据集相关信息，参见[MASS：语言生成的隐式序列到序列预训练](https://www.microsoft.com/en-us/research/uploads/prod/2019/06/MASS-paper-updated-002.pdf)。
 ## 特性
 MASS设计联合预训练编码器和解码器，来完成语言生成任务。
 首先，通过序列到序列的框架，MASS只预测阻塞的标记，迫使编码器理解未屏蔽标记的含义，并鼓励解码器从编码器中提取有用信息。
 其次，通过预测解码器的连续标记，可以建立比仅预测离散标记更好的语言建模能力。
 第三，通过进一步屏蔽编码器中未屏蔽的解码器的输入标记，鼓励解码器从编码器侧提取更有用的信息，而不是使用前一个标记中的丰富信息。
 ## 脚本说明
 MASS脚本及代码结构如下：
 ```text
 ├── mass
  ├── README.md                              // MASS模型介绍
  ├── config
  │   ├──config.py                           // 配置实例定义
  │   ├──config.json                         // 配置文件
  ├──src
  │   ├──dataset
  │      ├──bi_data_loader.py                // 数据集加载器，用于微调或推理
  │      ├──mono_data_loader.py              // 预训练数据集加载器
  │   ├──language_model
  │      ├──noise_channel_language_model.p   // 数据集生成噪声通道语言模型
  │      ├──mass_language_model.py           // 基于MASS论文的MASS语言模型
  │      ├──loose_masked_language_model.py   // 基于MASS发布代码的MASS语言模型
  │      ├──masked_language_model.py         // 基于MASS论文的MASS语言模型
  │   ├──transformer
  │      ├──create_attn_mask.py              // 生成屏蔽矩阵，除去填充部分
  │      ├──transformer.py                   // Transformer模型架构
  │      ├──encoder.py                       // Transformer编码器组件
  │      ├──decoder.py                       // Transformer解码器组件
  │      ├──self_attention.py                // 自注意块组件
  │      ├──multi_head_attention.py          // 多头自注意组件
  │      ├──embedding.py                     // 嵌入组件
  │      ├──positional_embedding.py          // 位置嵌入组件
  │      ├──feed_forward_network.py          // 前馈网络
  │      ├──residual_conn.py                 // 残留块
  │      ├──beam_search.py                   // 推理所用的波束搜索解码器
  │      ├──transformer_for_infer.py         // 使用Transformer进行推理
  │      ├──transformer_for_train.py         // 使用Transformer进行训练
  │   ├──utils
  │      ├──byte_pair_encoding.py            // 使用subword-nmt应用字节对编码（BPE）
  │      ├──dictionary.py                    // 字典
  │      ├──loss_moniter.py                  // 训练步骤中损失监控回调
  │      ├──lr_scheduler.py                  // 学习速率调度器
  │      ├──ppl_score.py                     // 基于N-gram的困惑度评分
  │      ├──rouge_score.py                   // 计算ROUGE得分
  │      ├──load_weights.py                  // 从检查点或者NPZ文件加载权重
  │      ├──initializer.py                   // 参数初始化器
  ├── vocab
  │   ├──all.bpe.codes                       // 字节对编码表（此文件需要用户自行生成）
  │   ├──all_en.dict.bin                     // 已学习到的词汇表（此文件需要用户自行生成）
  ├── scripts
  │   ├──run_ascend.sh                       // Ascend处理器上训练&评估模型脚本
  │   ├──run_gpu.sh                          // GPU处理器上训练&评估模型脚本
  │   ├──learn_subword.sh                    // 学习字节对编码
  │   ├──stop_training.sh                    // 停止训练
  ├── requirements.txt                       // 第三方包需求
  ├── train.py                               // 训练API入口
  ├── eval.py                                // 推理API入口
  ├── tokenize_corpus.py                     // 语料标记化
  ├── apply_bpe_encoding.py                  // 应用BPE进行编码
  ├── weights_average.py                     // 将各模型检查点平均转换到NPZ格式
  ├── news_crawl.py                          // 创建预训练所用的News Crawl数据集
  ├── gigaword.py                            // 创建Gigaword语料库
  ├── cornell_dialog.py                      // 创建Cornell电影对白数据集，用于对话应答
 ```
 ## 准备数据集
 自然语言处理任务的数据准备过程包括数据清洗、标记、编码和生成词汇表几个步骤。
 实验中，使用[字节对编码（BPE）](https://arxiv.org/abs/1508.07909)可以有效减少词汇量，减轻对OOV的影响。
 使用`src/utils/dictionary.py`可以基于BPE学习到的文本词典创建词汇表。
 有关BPE的更多详细信息，参见[Subword-nmt lib](https://www.cnpython.com/pypi/subword-nmt)或[论文](https://arxiv.org/abs/1508.07909)。
 实验中，根据News Crawl数据集的1.9万个句子，学习到的词汇量为45755个单词。
 这里我们简单介绍一下准备数据所需的脚本。
 ### 标记化
 使用`tokenize_corpus.py`可以标记`.txt`格式的文本语料。
 `tokenize_corpus.py`的主要参数如下：
 ```bash
 --corpus_folder:     Corpus folder path, if multi-folders are provided, use ',' split folders.
 --output_folder:     Output folder path.
 --tokenizer:         Tokenizer to be used, nltk or jieba, if nltk is not installed fully, use jieba instead.
 --pool_size:         Processes pool size.
 ```
 示例代码如下：
 ```bash
 python tokenize_corpus.py --corpus_folder /{path}/corpus --output_folder /{path}/tokenized_corpus --tokenizer {nltk|jieba} --pool_size 16
 ```
 ### 字节对编码
 标记化后，使用提供的`all.bpe.codes`对标记后的语料进行字节对编码处理。
 应用BPE所需的脚本为`apply_bpe_encoding.py`。
 `apply_bpe_encoding.py`的主要参数如下：
 ```bash
 --codes:            BPE codes file.
 --src_folder:       Corpus folders.
 --output_folder:    Output files folder.
 --prefix:           Prefix of text file in `src_folder`.
 --vocab_path:       Generated vocabulary output path.
 --threshold:        Filter out words that frequency is lower than threshold.
 --processes:        Size of process pool (to accelerate).Default: 2.
 ```
 示例代码如下：
 ```bash
 python tokenize_corpus.py --codes /{path}/all.bpe.codes \
    --src_folder /{path}/tokenized_corpus \
    --output_folder /{path}/tokenized_corpus/bpe \
    --prefix tokenized \
    --vocab_path /{path}/vocab_en.dict.bin
    --processes 32
 ```
 ### 构建词汇表
 如需创建新词汇表，可任选下列方法之一：
 1. 重新学习字节对编码，从`subword-nmt`的多个词汇表文件创建词汇表。
 2. 基于现有词汇文件创建词汇表，该词汇文件行格式为`word frequency`。
 3. *（可选）* 基于`vocab/all_en.dict.bin`，应用`src/utils/dictionary.py`中的`shink`方法创建一个小词汇表。
 4. 应用`persistence()`方法将词汇表持久化到`vocab`文件夹。
 `src/utils/dictionary.py`的主要接口如下：
 1. `shrink(self, threshold=50)`：通过过滤词频低于阈值的单词来缩小词汇量，并返回一个新的词汇表。
 2. `load_from_text(cls, filepaths: List[str])`：加载现有文本词汇表，行格式为`word frequency`。  
 3. `load_from_persisted_dict(cls, filepath)`：加载通过调用`persistence()`方法保存的持久化二进制词汇表。
 4. `persistence(self, path)`：将词汇表对象保存为二进制文件。
 示例代码如下：
 ```python
 from src.utils import Dictionary
 vocabulary = Dictionary.load_from_persisted_dict("vocab/all_en.dict.bin")
 tokens = [1, 2, 3, 4, 5]
 # Convert ids to symbols.
 print([vocabulary[t] for t in tokens])
 sentence = ["Hello", "world"]
 # Convert symbols to ids.
 print([vocabulary.index[s] for s in sentence])
 ```
 相关信息，参见源文件。
 ### 生成数据集
 如前所述，MASS模式下使用了三个语料数据集，相关数据集生成脚本已提供。
 #### News Crawl语料库
 数据集生成脚本为`news_crawl.py`。
 `news_crawl.py`的主要参数如下：
 ```bash
 Note that please provide `--existed_vocab` or `--dict_folder` at least one.
 A new vocabulary would be created in `output_folder` when pass `--dict_folder`.
 --src_folder:       Corpus folders.
 --existed_vocab:    Optional, persisted vocabulary file.
 --mask_ratio:       Ratio of mask.
 --output_folder:    Output dataset files folder path.
 --max_len:          Maximum sentence length.If a sentence longer than `max_len`, then drop it.
 --suffix:           Optional, suffix of generated dataset files.
 --processes:        Optional, size of process pool (to accelerate).Default: 2.
 ```
 示例代码如下：
 ```bash
 python news_crawl.py --src_folder /{path}/news_crawl \
    --existed_vocab /{path}/mass/vocab/all_en.dict.bin \
    --mask_ratio 0.5 \
    --output_folder /{path}/news_crawl_dataset \
    --max_len 32 \
    --processes 32
 ```
 #### Gigaword语料库
 数据集生成脚本为`gigaword.py`。
 `gigaword.py`主要参数如下：
 ```bash
 --train_src:        Train source file path.
 --train_ref:        Train reference file path.
 --test_src:         Test source file path.
 --test_ref:         Test reference file path.
 --existed_vocab:    Persisted vocabulary file.
 --output_folder:    Output dataset files folder path.
 --noise_prob:       Optional, add noise prob.Default: 0.
 --max_len:          Optional, maximum sentence length.If a sentence longer than `max_len`, then drop it.Default: 64.
 --format:           Optional, dataset format, "mindrecord" or "tfrecord".Default: "tfrecord".
 ```
 示例代码如下：
 ```bash
 python gigaword.py --train_src /{path}/gigaword/train_src.txt \
    --train_ref /{path}/gigaword/train_ref.txt \
    --test_src /{path}/gigaword/test_src.txt \
    --test_ref /{path}/gigaword/test_ref.txt \
    --existed_vocab /{path}/mass/vocab/all_en.dict.bin \
    --noise_prob 0.1 \
    --output_folder /{path}/gigaword_dataset \
    --max_len 64
 ```
 #### Cornell电影对白语料库
 数据集生成脚本为`cornell_dialog.py`。
 `cornell_dialog.py`主要参数如下：
 ```bash
 --src_folder:       Corpus folders.
 --existed_vocab:    Persisted vocabulary file.
 --train_prefix:     Train source and target file prefix.Default: train.
 --test_prefix:      Test source and target file prefix.Default: test.
 --output_folder:    Output dataset files folder path.
 --max_len:          Maximum sentence length.If a sentence longer than `max_len`, then drop it.
 --valid_prefix:     Optional, Valid source and target file prefix.Default: valid.
 ```
 示例代码如下：
 ```bash
 python cornell_dialog.py --src_folder /{path}/cornell_dialog \
    --existed_vocab /{path}/mass/vocab/all_en.dict.bin \
    --train_prefix train \
    --test_prefix test \
    --noise_prob 0.1 \
    --output_folder /{path}/cornell_dialog_dataset \
    --max_len 64
 ```
 ## 配置
 `config/`目录下的JSON文件为模板配置文件，
 便于为大多数选项及参数赋值，包括训练平台、数据集和模型的配置、优化器参数等。还可以通过设置相应选项，获得诸如损失放大和检查点等可选特性。
 有关属性的详细信息，参见`config/config.py`文件。
 ## 训练&评估过程
 训练模型时，只需使用shell脚本`run_ascend.sh`或`run_gpu.sh`即可。脚本中设置了环境变量，执行`mass`下的`train.py`训练脚本。
 您可以通过选项赋值来启动单卡或多卡训练任务，在bash中运行如下命令：
 Ascend处理器：
 ```ascend
 sh run_ascend.sh [--options]
 ```
 GPU处理器：
 ```gpu
 sh run_gpu.sh [--options]
 ```
 `run_ascend.sh`的用法如下：
 ```text
 Usage: run_ascend.sh [-h, --help] [-t, --task <CHAR>] [-n, --device_num <N>]
                     [-i, --device_id <N>] [-j, --hccl_json <FILE>]
                     [-c, --config <FILE>] [-o, --output <FILE>]
                     [-v, --vocab <FILE>]
 options:
    -h, --help               show usage
    -t, --task               select task: CHAR, 't' for train and 'i' for inference".
    -n, --device_num         device number used for training: N, default is 1.
    -i, --device_id          device id used for training with single device: N, 0<=N<=7, default is 0.
    -j, --hccl_json          rank table file used for training with multiple devices: FILE.
    -c, --config             configuration file as shown in the path 'mass/config': FILE.
    -o, --output             assign output file of inference: FILE.
    -v, --vocab              set the vocabulary.
    -m, --metric             set the metric.
 ```
 说明：运行分布式训练时，确保已配置`hccl_json`文件。
 `run_gpu.sh`的用法如下：
 ```text
 Usage: run_gpu.sh [-h, --help] [-t, --task <CHAR>] [-n, --device_num <N>]
                     [-i, --device_id <N>] [-c, --config <FILE>]
                     [-o, --output <FILE>] [-v, --vocab <FILE>]
 options:
    -h, --help               show usage
    -t, --task               select task: CHAR, 't' for train and 'i' for inference".
    -n, --device_num         device number used for training: N, default is 1.
    -i, --device_id          device id used for training with single device: N, 0<=N<=7, default is 0.
    -c, --config             configuration file as shown in the path 'mass/config': FILE.
    -o, --output             assign output file of inference: FILE.
    -v, --vocab              set the vocabulary.
    -m, --metric             set the metric.
 ```
 运行如下命令进行2卡训练。
 Ascend处理器：
 ```ascend
 sh run_ascend.sh --task t --device_num 2 --hccl_json /{path}/rank_table.json --config /{path}/config.json
 ```
 注：`run_ascend.sh`暂不支持不连续设备ID，`rank_table.json`中的设备ID必须从0开始。
 GPU处理器：
 ```gpu
 sh run_gpu.sh --task t --device_num 2 --config /{path}/config.json
 ```
 运行如下命令进行单卡训练：
 Ascend处理器：
 ```ascend
 sh run_ascend.sh --task t --device_num 1 --device_id 0 --config /{path}/config.json
 ```
 GPU处理器：
 ```gpu
 sh run_gpu.sh --task t --device_num 1 --device_id 0 --config /{path}/config.json
 ```
 ## 权重平均值
 ```python
 python weights_average.py --input_files your_checkpoint_list --output_file model.npz
 ```
 `input_files`为检查点文件清单。如需使用`model.npz`作为权重文件，请在“existed_ckpt”的`config.json`文件中添加`model.npz`的路径。
 ```json
 {
  ...
  "checkpoint_options": {
    "existed_ckpt": "/xxx/xxx/model.npz",
    "save_ckpt_steps": 1000,
    ...
  },
  ...
 }
 ```
 ## 学习速率调度器
 模型中提供了两个学习速率调度器：
 1. [多项式衰减调度器](https://towardsdatascience.com/learning-rate-schedules-and-adaptive-learning-rate-methods-for-deep-learning-2c8f433990d1)。
 2. [逆平方根调度器](https://ece.uwaterloo.ca/~dwharder/aads/Algorithms/Inverse_square_root/)。
 可以在`config/config.json`文件中配置学习率调度器。
 多项式衰减调度器配置文件示例如下：
 ```json
 {
  ...
  "learn_rate_config": {
    "optimizer": "adam",
    "lr": 1e-4,
    "lr_scheduler": "poly",
    "poly_lr_scheduler_power": 0.5,
    "decay_steps": 10000,
    "warmup_steps": 2000,
    "min_lr": 1e-6
  },
  ...
 }
 ```
 逆平方根调度器配置文件示例如下：
 ```json
 {
  ...
  "learn_rate_config": {
    "optimizer": "adam",
    "lr": 1e-4,
    "lr_scheduler": "isr",
    "decay_start_step": 12000,
    "warmup_steps": 2000,
    "min_lr": 1e-6
  },
  ...
 }
 ```
 有关学习率调度器的更多详细信息，参见`src/utils/lr_scheduler.py`。
 # 环境要求
 ## 平台
 - 硬件（Ascend或GPU）
    - 使用Ascend或GPU处理器准备硬件环境。- 如需试用昇腾处理器，请发送[申请表](https://obs-9be7.obs.cn-east-2.myhuaweicloud.com/file/other/Ascend%20Model%20Zoo%E4%BD%93%E9%AA%8C%E8%B5%84%E6%BA%90%E7%94%B3%E8%AF%B7%E8%A1%A8.docx)至ascend@huawei.com，申请通过即可获得资源。
 - 框架
    - [MindSpore](https://www.mindspore.cn/install)
 - 更多关于Mindspore的信息，请查看以下资源：
    - [MindSpore教程](https://www.mindspore.cn/tutorial/training/zh-CN/master/index.html)
    - [MindSpore Python API](https://www.mindspore.cn/doc/api_python/zh-CN/master/index.html)
 ## 其他要求
 ```txt
 nltk
 numpy
 subword-nmt
 rouge
 ```
 <https://www.mindspore.cn/tutorial/training/zh-CN/master/advanced_use/migrate_3rd_scripts.html>
 # 快速上手
 MASS通过预测输入序列中被屏蔽的片段来预训练序列到序列模型。之后，选择下游的文本摘要或会话应答任务进行模型微调和推理。
 这里提供了一个练习示例来演示应用MASS，对模型进行预训练、微调的基本用法，以及推理过程。操作步骤如下：
 1. 下载并处理数据集。
 2. 修改`config.json`文件，配置网络。
 3. 运行预训练和微调任务。
 4. 进行推理验证。
 ## 预训练
 预训练模型时，首先配置`config.json`中的选项：
 - 将`dataset_config`节点下的`pre_train_dataset`配置为数据集路径。
 - 选择优化器（可采用'momentum/adam/lamb’）。
 - 在`checkpoint_path`下，指定'ckpt_prefix'和'ckpt_path'来保存模型文件。
 - 配置其他参数，包括数据集配置和网络配置。
 - 如果已经有训练好的模型，请将`existed_ckpt`配置为该检查点文件。
 如使用Ascend芯片，执行`run_ascend.sh`这个shell脚本：
 ```ascend
 sh run_ascend.sh -t t -n 1 -i 1 -c /mass/config/config.json
 ```
 如使用GPU处理器，执行`run_gpu.sh`这个shell脚本：
 ```gpu
 sh run_gpu.sh -t t -n 1 -i 1 -c /mass/config/config.json
 ```
 日志和输出文件可以在`./train_mass_*/`路径下获取，模型文件可以在`config/config.json`配置文件中指定的路径下获取。
 ## 微调
 预训练模型时，首先配置`config.json`中的选项：
 - 将`dataset_config`节点下的`fine_tune_dataset`配置为数据集路径。
 - 将`checkpoint_path`节点下的`existed_ckpt`赋值给预训练生成的已有模型文件。
 - 选择优化器（可采用'momentum/adam/lamb’）。
 - 在`checkpoint_path`下，指定'ckpt_prefix'和'ckpt_path'来保存模型文件。
 - 配置其他参数，包括数据集配置和网络配置。
 如使用Ascend芯片，执行`run_ascend.sh`这个shell脚本：
 ```ascend
 sh run_ascend.sh -t t -n 1 -i 1 -c config/config.json
 ```
 如使用GPU处理器，执行`run_gpu.sh`这个shell脚本：
 ```gpu
 sh run_gpu.sh -t t -n 1 -i 1 -c config/config.json
 ```
 日志和输出文件可以在`./train_mass_*/`路径下获取，模型文件可以在`config/config.json`配置文件中指定的路径下获取。
 ## 推理
 如果您需要使用此训练模型在GPU、Ascend 910、Ascend 310等多个硬件平台上进行推理，可参考此[链接](https://www.mindspore.cn/tutorial/training/zh-CN/master/advanced_use/migrate_3rd_scripts.html)。
 推理时，请先配置`config.json`中的选项：
 - 将`dataset_config`节点下的`test_dataset`配置为数据集路径。
 - 将`dataset_config`节点下的`test_dataset`配置为数据集路径。
 - 选择优化器（可采用'momentum/adam/lamb’）。
 - 在`checkpoint_path`下，指定'ckpt_prefix'和'ckpt_path'来保存模型文件。
 - 配置其他参数，包括数据集配置和网络配置。
 如使用Ascend芯片，执行`run_ascend.sh`这个shell脚本：
 ```bash
 sh run_ascend.sh -t i -n 1 -i 1 -c config/config.json -o {outputfile}
 ```
 如使用GPU处理器，执行`run_gpu.sh`这个shell脚本：
 ```gpu
 sh run_gpu.sh -t i -n 1 -i 1 -c config/config.json -o {outputfile}
 ```
 # 性能
 ## 结果
 ### 文本摘要微调
 下表展示了，相较于其他两种预训练方法，MASS在文本摘要任务中的ROUGE得分情况。
 训练数据大小为3.8M。
 | 方法| RG-1(F) | RG-2(F) | RG-L(F) |
 |:---------------|:--------------|:-------------|:-------------|
 | MASS | 进行中 | 进行中 | 进行中 |
 ### 会话应答微调
 下表展示了，相较于其他两种基线方法，MASS在Cornell电影对白语料库中困惑度（PPL）的得分情况。
 | 方法 | 数据 = 10K | 数据 = 110K |
 |--------------------|------------------|-----------------|
 | MASS | 进行中 | 进行中 |
 ### 训练性能
 | 参数 | 掩式序列到序列预训练语言生成 |
 |:---------------------------|:--------------------------------------------------------------------------|
 | 模型版本              | v1                                                                        |
 | 资源                   | Ascend 910；CPU：2.60GHz，192核；内存：755GB              |
 | 上传日期              | 2020-05-24                                                 |
 | MindSpore版本          | 0.2.0                                                                     |
 | 数据集 | News Crawl 2007-2017英语单语语料库、Gigaword语料库、Cornell电影对白语料库 |
 | 训练参数 | Epoch=50, steps=XXX, batch_size=192, lr=1e-4 |
 | 优化器                  | Adam                                                        |
 | 损失函数 | 标签平滑交叉熵准则 |
 | 输出 | 句子及概率 |
 | 损失                       | 小于2                                                            |
 | 准确性 | 会话应答PPL=23.52，文本摘要RG-1=29.79|
 | 速度                      | 611.45句子/秒                              |
 | 总时长                 |                               |
 | 参数(M)                 | 44.6M                                                          |
 ### 推理性能
 | 参数 | 掩式序列到序列预训练语言生成 |
 |:---------------------------|:-----------------------------------------------------------|
 |模型版本| V1 |
 | 资源                  | Ascend 910                                                     |
 | 上传日期 | 2020-05-24 |
 | MindSpore版本 | 0.2.0 |
 | 数据集 | Gigaword语料库、Cornell电影对白语料库 |
 | batch_size          | ---                                                        |
 | 输出 | 句子及概率 |
 | 准确度 | 会话应答PPL=23.52，文本摘要RG-1=29.79|
 | 速度                      | ----句子/秒                              |
 | 总时长 | --/-- |
 # 随机情况说明
 MASS模型涉及随机失活（dropout）操作，如需禁用此功能，请在`config/config.json`中将dropout_rate设置为0。
 # 其他
 该模型已在Ascend环境下与GPU环境下得到验证，尚未在CPU环境下验证。
 # ModelZoo主页  
 [链接](https://gitee.com/mindspore/mindspore/tree/master/model_zoo)