Huawei_Technology
/
mindspore

# Contents
- [BERT Description](#bert-description)
- [Model Architecture](#model-architecture)
- [Dataset](#dataset)
- [Environment Requirements](#environment-requirements)
- [Quick Start](#quick-start)
- [Script Description](#script-description)
    - [Script and Sample Code](#script-and-sample-code)
    - [Script Parameters](#script-parameters)
    - [Dataset Preparation](#dataset-preparation)
    - [Training Process](#training-process)
    - [Evaluation Process](#evaluation-process)
- [Model Description](#model-description)
    - [Performance](#performance)
        - [Training Performance](#training-performance)
        - [Evaluation Performance](#evaluation-performance)
- [Description of Random Situation](#description-of-random-situation)
- [ModelZoo Homepage](#modelzoo-homepage)

# [BERT Description](#contents)
The BERT network was proposed by Google in 2018. The network has made a breakthrough in the field of NLP. The network uses pre-training to achieve a large network structure without modifying, and only by adding an output layer to achieve multiple text-based tasks in fine-tuning. The backbone code of BERT adopts the Encoder structure of Transformer. The attention mechanism is introduced to enable the output layer to capture high-latitude global semantic information. The pre-training uses denoising and self-encoding tasks, namely MLM(Masked Language Model) and NSP(Next Sentence Prediction). No need to label data, pre-training can be performed on massive text data, and only a small amount of data to fine-tuning downstream tasks to obtain good results. The pre-training plus fune-tuning mode created by BERT is widely adopted by subsequent NLP networks.

[Paper](https://arxiv.org/abs/1810.04805):  Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova. [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding]((https://arxiv.org/abs/1810.04805)). arXiv preprint arXiv:1810.04805. 

[Paper](https://arxiv.org/abs/1909.00204):  Junqiu Wei, Xiaozhe Ren, Xiaoguang Li, Wenyong Huang, Yi Liao, Yasheng Wang, Jiashu Lin, Xin Jiang, Xiao Chen, Qun Liu. [NEZHA: Neural Contextualized Representation for Chinese Language Understanding](https://arxiv.org/abs/1909.00204). arXiv preprint arXiv:1909.00204.

# [Model Architecture](#contents)
The backbone structure of BERT is transformer. For BERT_base, the transformer contains 12 encoder modules, each module contains one self-attention module and each self-attention module contains one attention module. For BERT_NEZHA, the transformer contains 24 encoder modules, each module contains one self-attention module and each self-attention module contains one attention module. The difference between BERT_base and BERT_NEZHA is that BERT_base uses absolute position encoding to produce position embedding vector and BERT_NEZHA uses relative position encoding.  

# [Dataset](#contents)
- Download the zhwiki or enwiki dataset for pre-training. Extract and refine texts in the dataset with [WikiExtractor](https://github.com/attardi/wikiextractor). Convert the dataset to TFRecord format. Please refer to create_pretraining_data.py file in [BERT](https://github.com/google-research/bert) repository.
- Download dataset for fine-tuning and evaluation such as CLUENER, TNEWS, SQuAD v1.1, etc. Convert dataset files from JSON format to TFRECORD format, please refer to run_classifier.py file in [BERT](https://github.com/google-research/bert) repository.

# [Environment Requirements](#contents)
- Hardware（Ascend）
  - Prepare hardware environment with Ascend processor. If you want to try Ascend, please send the [application form](https://obs-9be7.obs.cn-east-2.myhuaweicloud.com/file/other/Ascend%20Model%20Zoo%E4%BD%93%E9%AA%8C%E8%B5%84%E6%BA%90%E7%94%B3%E8%AF%B7%E8%A1%A8.docx) to ascend@huawei.com. Once approved, you can get access to the resources. 
- Framework
  - [MindSpore](https://gitee.com/mindspore/mindspore)
- For more information, please check the resources below：
  - [MindSpore tutorials](https://www.mindspore.cn/tutorial/en/master/index.html) 
  - [MindSpore API](https://www.mindspore.cn/api/en/master/index.html)

# [Quick Start](#contents)
After installing MindSpore via the official website, you can start pre-training, fine-tuning and evaluation as follows:
```bash
# run standalone pre-training example
bash scripts/run_standalone_pretrain_ascend.sh 0 1 /path/cn-wiki-128

# run distributed pre-training example
bash scripts/run_distributed_pretrain_ascend.sh /path/cn-wiki-128 /path/hccl.json

# run fine-tuning and evaluation example
- If you are going to run a fine-tuning task, please prepare a checkpoint generated from pre-training.
- Set bert network config and optimizer hyperparameters in `finetune_eval_config.py`. 
    
- Classification task: Set task related hyperparameters in scripts/run_classifier.sh. 
- Run `bash scripts/run_classifier.py` for fine-tuning of BERT-base and BERT-NEZHA model.

  bash scripts/run_classifier.sh
  
- NER task: Set task related hyperparameters in scripts/run_ner.sh.
- Run `bash scripts/run_ner.py` for fine-tuning of BERT-base and BERT-NEZHA model.

  bash scripts/run_ner.sh
      
- SQuAD task: Set task related hyperparameters in scripts/run_squad.sh. 
- Run `bash scripts/run_squad.py` for fine-tuning of BERT-base and BERT-NEZHA model.

  bash scripts/run_squad.sh    
```

For distributed training, an hccl configuration file with JSON format needs to be created in advance.
Please follow the instructions in the link below:
https:gitee.com/mindspore/mindspore/tree/master/model_zoo/utils/hccl_tools.

For dataset, if you want to set the format and parameters, a schema configuration file with JSON format needs to be created, please refer to [tfrecord](https://www.mindspore.cn/tutorial/zh-CN/master/use/data_preparation/loading_the_datasets.html#tfrecord) format.
```
For pretraining, schema file contains ["input_ids", "input_mask", "segment_ids", "next_sentence_labels", "masked_lm_positions", "masked_lm_ids", "masked_lm_weights"]. 

For ner or classification task, schema file contains ["input_ids", "input_mask", "segment_ids", "label_ids"].

For squad task, training: schema file contains ["start_positions", "end_positions", "input_ids", "input_mask", "segment_ids"], evaluation: schema file contains ["input_ids", "input_mask", "segment_ids"].

`numRows` is the only option which could be set by user, other values must be set according to the dataset.

For example, the schema file of cn-wiki-128 dataset for pretraining shows as follows:
{
    "datasetType": "TF",
    "numRows": 7680,
    "columns": {
        "input_ids": {
            "type": "int64",
            "rank": 1,
            "shape": [128]
        },
        "input_mask": {
            "type": "int64",
            "rank": 1,
            "shape": [128]
        },
        "segment_ids": {
            "type": "int64",
            "rank": 1,
            "shape": [128]
        },
        "next_sentence_labels": {
            "type": "int64",
            "rank": 1,
            "shape": [1]
        },
        "masked_lm_positions": {
            "type": "int64",
            "rank": 1,
            "shape": [20]
        },
        "masked_lm_ids": {
            "type": "int64",
            "rank": 1,
            "shape": [20]
        },
        "masked_lm_weights": {
            "type": "float32",
            "rank": 1,
            "shape": [20]
        }
    }
}
``` 

# [Script Description](#contents)

## [Script and Sample Code](#contents)

```shell
.
└─bert
  ├─README.md
  ├─scripts
    ├─ascend_distributed_launcher
        ├─__init__.py
        ├─hyper_parameter_config.ini          # hyper paramter for distributed pretraining 
        ├─run_distribute_pretrain.py          # script for distributed pretraining
        ├─README.md    
    ├─run_classifier.sh                       # shell script for standalone classifier task on ascend or gpu
    ├─run_ner.sh                              # shell script for standalone NER task on ascend or gpu
    ├─run_squad.sh                            # shell script for standalone SQUAD task on ascend or gpu
    ├─run_standalone_pretrain_ascend.sh       # shell script for standalone pretrain on ascend
    ├─run_distributed_pretrain_ascend.sh      # shell script for distributed pretrain on ascend
    ├─run_distributed_pretrain_gpu.sh         # shell script for distributed pretrain on gpu
    └─run_standaloned_pretrain_gpu.sh         # shell script for distributed pretrain on gpu
  ├─src
    ├─__init__.py
    ├─assessment_method.py                    # assessment method for evaluation
    ├─bert_for_finetune.py                    # backbone code of network
    ├─bert_for_pre_training.py                # backbone code of network
    ├─bert_model.py                           # backbone code of network
    ├─clue_classification_dataset_precess.py  # data preprocessing
    ├─cluner_evaluation.py                    # evaluation for cluner   
    ├─config.py                               # parameter configuration for pretraining
    ├─CRF.py                                  # assessment method for clue dataset 
    ├─dataset.py                              # data preprocessing
    ├─finetune_eval_config.py                 # parameter configuration for finetuning
    ├─finetune_eval_model.py                  # backbone code of network
    ├─sample_process.py                       # sample processing
    ├─utils.py                                # util function
  ├─pretrain_eval.py                          # train and eval net  
  ├─run_classifier.py                         # finetune and eval net for classifier task
  ├─run_ner.py                                # finetune and eval net for ner task
  ├─run_pretrain.py                           # train net for pretraining phase
  └─run_squad.py                              # finetune and eval net for squad task
```

## [Script Parameters](#contents)
### Pre-Training
``` 
usage: run_pretrain.py  [--distribute DISTRIBUTE] [--epoch_size N] [----device_num N] [--device_id N] 
                        [--enable_save_ckpt ENABLE_SAVE_CKPT] [--device_target DEVICE_TARGET]
                        [--enable_lossscale ENABLE_LOSSSCALE] [--do_shuffle DO_SHUFFLE]
                        [--enable_data_sink ENABLE_DATA_SINK] [--data_sink_steps N] 
                        [--accumulation_steps N]
                        [--save_checkpoint_path SAVE_CHECKPOINT_PATH]
                        [--load_checkpoint_path LOAD_CHECKPOINT_PATH]
                        [--save_checkpoint_steps N] [--save_checkpoint_num N] 
                        [--data_dir DATA_DIR] [--schema_dir SCHEMA_DIR] [train_steps N]

options:
    --device_target            device where the code will be implemented: "Ascend" | "GPU", default is "Ascend"
    --distribute               pre_training by serveral devices: "true"(training by more than 1 device) | "false", default is "false"
    --epoch_size               epoch size: N, default is 1
    --device_num               number of used devices: N, default is 1
    --device_id                device id: N, default is 0
    --enable_save_ckpt         enable save checkpoint: "true" | "false", default is "true"
    --enable_lossscale         enable lossscale: "true" | "false", default is "true"
    --do_shuffle               enable shuffle: "true" | "false", default is "true"
    --enable_data_sink         enable data sink: "true" | "false", default is "true"
    --data_sink_steps          set data sink steps: N, default is 1
    --accumulation_steps       accumulate gradients N times before weight update: N, default is 1
    --save_checkpoint_path     path to save checkpoint files: PATH, default is ""
    --load_checkpoint_path     path to load checkpoint files: PATH, default is ""
    --save_checkpoint_steps    steps for saving checkpoint files: N, default is 1000
    --save_checkpoint_num      number for saving checkpoint files: N, default is 1
    --train_steps              Training Steps: N, default is -1
    --data_dir                 path to dataset directory: PATH, default is ""
    --schema_dir               path to schema.json file, PATH, default is ""
```
### Fine-Tuning and Evaluation
```
usage: run_ner.py   [--device_target DEVICE_TARGET] [--do_train DO_TRAIN] [----do_eval DO_EVAL] 
                    [--assessment_method ASSESSMENT_METHOD] [--use_crf USE_CRF] 
                    [--device_id N] [--epoch_num N] [--vocab_file_path VOCAB_FILE_PATH]
                    [--label2id_file_path LABEL2ID_FILE_PATH] 
                    [--train_data_shuffle TRAIN_DATA_SHUFFLE] 
                    [--eval_data_shuffle EVAL_DATA_SHUFFLE] 
                    [--save_finetune_checkpoint_path SAVE_FINETUNE_CHECKPOINT_PATH]
                    [--load_pretrain_checkpoint_path LOAD_PRETRAIN_CHECKPOINT_PATH] 
                    [--train_data_file_path TRAIN_DATA_FILE_PATH] 
                    [--eval_data_file_path EVAL_DATA_FILE_PATH] 
                    [--schema_file_path SCHEMA_FILE_PATH]
options:
    --device_target                   device where the code will be implemented: "Ascend" | "GPU", default is "Ascend"
    --do_train                        whether to run training on training set: true | false
    --do_eval                         whether to run eval on dev set: true | false
    --assessment_method               assessment method to do evaluation: f1 | clue_benchmark
    --use_crf                         whether to use crf to calculate loss: true | false
    --device_id                       device id to run task
    --epoch_num                       total number of training epochs to perform
    --num_class                       number of classes to do labeling
    --train_data_shuffle              Enable train data shuffle, default is true
    --eval_data_shuffle               Enable eval data shuffle, default is true
    --vocab_file_path                 the vocabulary file that the BERT model was trained on
    --label2id_file_path              label to id json file
    --save_finetune_checkpoint_path   path to save generated finetuning checkpoint
    --load_pretrain_checkpoint_path   initial checkpoint (usually from a pre-trained BERT model)
    --load_finetune_checkpoint_path   give a finetuning checkpoint path if only do eval
    --train_data_file_path            ner tfrecord for training. E.g., train.tfrecord
    --eval_data_file_path             ner tfrecord for predictions if f1 is used to evaluate result, ner json for predictions if clue_benchmark is used to evaluate result
    --schema_file_path                path to datafile schema file

usage: run_squad.py [--device_target DEVICE_TARGET] [--do_train DO_TRAIN] [----do_eval DO_EVAL]                    
                    [--device_id N] [--epoch_num N] [--num_class N]
                    [--vocab_file_path VOCAB_FILE_PATH]
                    [--eval_json_path EVAL_JSON_PATH] 
                    [--train_data_shuffle TRAIN_DATA_SHUFFLE] 
                    [--eval_data_shuffle EVAL_DATA_SHUFFLE] 
                    [--save_finetune_checkpoint_path SAVE_FINETUNE_CHECKPOINT_PATH]
                    [--load_pretrain_checkpoint_path LOAD_PRETRAIN_CHECKPOINT_PATH] 
                    [--load_finetune_checkpoint_path LOAD_FINETUNE_CHECKPOINT_PATH] 
                    [--train_data_file_path TRAIN_DATA_FILE_PATH] 
                    [--eval_data_file_path EVAL_DATA_FILE_PATH] 
                    [--schema_file_path SCHEMA_FILE_PATH]
options:
    --device_target                   device where the code will be implemented: "Ascend" | "GPU", default is "Ascend"
    --do_train                        whether to run training on training set: true | false
    --do_eval                         whether to run eval on dev set: true | false
    --device_id                       device id to run task
    --epoch_num                       total number of training epochs to perform
    --num_class                       number of classes to classify, usually 2 for squad task
    --train_data_shuffle              Enable train data shuffle, default is true
    --eval_data_shuffle               Enable eval data shuffle, default is true
    --vocab_file_path                 the vocabulary file that the BERT model was trained on
    --eval_json_path                  path to squad dev json file
    --save_finetune_checkpoint_path   path to save generated finetuning checkpoint
    --load_pretrain_checkpoint_path   initial checkpoint (usually from a pre-trained BERT model)
    --load_finetune_checkpoint_path   give a finetuning checkpoint path if only do eval
    --train_data_file_path            squad tfrecord for training. E.g., train1.1.tfrecord
    --eval_data_file_path             squad tfrecord for predictions. E.g., dev1.1.tfrecord
    --schema_file_path                path to datafile schema file

usage: run_classifier.py [--device_target DEVICE_TARGET] [--do_train DO_TRAIN] [----do_eval DO_EVAL]                    
                         [--assessment_method ASSESSMENT_METHOD] [--device_id N] [--epoch_num N] [--num_class N]
                         [--save_finetune_checkpoint_path SAVE_FINETUNE_CHECKPOINT_PATH]
                         [--load_pretrain_checkpoint_path LOAD_PRETRAIN_CHECKPOINT_PATH] 
                         [--load_finetune_checkpoint_path LOAD_FINETUNE_CHECKPOINT_PATH] 
                         [--train_data_shuffle TRAIN_DATA_SHUFFLE] 
                         [--eval_data_shuffle EVAL_DATA_SHUFFLE] 
                         [--train_data_file_path TRAIN_DATA_FILE_PATH] 
                         [--eval_data_file_path EVAL_DATA_FILE_PATH] 
                         [--schema_file_path SCHEMA_FILE_PATH]
options:
    --device_target                   targeted device to run task: Ascend | GPU
    --do_train                        whether to run training on training set: true | false
    --do_eval                         whether to run eval on dev set: true | false
    --assessment_method               assessment method to do evaluation: accuracy | f1 | mcc | spearman_correlation
    --device_id                       device id to run task
    --epoch_num                       total number of training epochs to perform
    --num_class                       number of classes to do labeling
    --train_data_shuffle              Enable train data shuffle, default is true
    --eval_data_shuffle               Enable eval data shuffle, default is true
    --save_finetune_checkpoint_path   path to save generated finetuning checkpoint
    --load_pretrain_checkpoint_path   initial checkpoint (usually from a pre-trained BERT model)
    --load_finetune_checkpoint_path   give a finetuning checkpoint path if only do eval
    --train_data_file_path            tfrecord for training. E.g., train.tfrecord
    --eval_data_file_path             tfrecord for predictions. E.g., dev.tfrecord
    --schema_file_path                path to datafile schema file
```
## Options and Parameters
Parameters for training and evaluation can be set in file `config.py` and `finetune_eval_config.py` respectively.
### Options:
```
config for lossscale and etc.
    bert_network                    version of BERT model: base | nezha, default is base
    loss_scale_value                initial value of loss scale: N, default is 2^32
    scale_factor                    factor used to update loss scale: N, default is 2
    scale_window                    steps for once updatation of loss scale: N, default is 1000   
    optimizer                       optimizer used in the network: AdamWerigtDecayDynamicLR | Lamb | Momentum, default is "Lamb"
```

### Parameters:
```
Parameters for dataset and network (Pre-Training/Fine-Tuning/Evaluation):
    batch_size                      batch size of input dataset: N, default is 16
    seq_length                      length of input sequence: N, default is 128
    vocab_size                      size of each embedding vector: N, must be consistant with the dataset you use. Default is 21136
    hidden_size                     size of bert encoder layers: N, default is 768
    num_hidden_layers               number of hidden layers: N, default is 12
    num_attention_heads             number of attention heads: N, default is 12
    intermediate_size               size of intermediate layer: N, default is 3072
    hidden_act                      activation function used: ACTIVATION, default is "gelu"
    hidden_dropout_prob             dropout probability for BertOutput: Q, default is 0.1
    attention_probs_dropout_prob    dropout probability for BertAttention: Q, default is 0.1
    max_position_embeddings         maximum length of sequences: N, default is 512
    type_vocab_size                 size of token type vocab: N, default is 16
    initializer_range               initialization value of TruncatedNormal: Q, default is 0.02
    use_relative_positions          use relative positions or not: True | False, default is False
    input_mask_from_dataset         use the input mask loaded form dataset or not: True | False, default is True
    token_type_ids_from_dataset     use the token type ids loaded from dataset or not: True | False, default is True
    dtype                           data type of input: mstype.float16 | mstype.float32, default is mstype.float32
    compute_type                    compute type in BertTransformer: mstype.float16 | mstype.float32, default is mstype.float16

Parameters for optimizer:
    AdamWeightDecay:
    decay_steps                     steps of the learning rate decay: N
    learning_rate                   value of learning rate: Q
    end_learning_rate               value of end learning rate: Q, must be positive
    power                           power: Q
    warmup_steps                    steps of the learning rate warm up: N
    weight_decay                    weight decay: Q
    eps                             term added to the denominator to improve numerical stability: Q

    Lamb:
    decay_steps                     steps of the learning rate decay: N
    learning_rate                   value of learning rate: Q
    end_learning_rate               value of end learning rate: Q
    power                           power: Q
    warmup_steps                    steps of the learning rate warm up: N
    weight_decay                    weight decay: Q

    Momentum:
    learning_rate                   value of learning rate: Q
    momentum                        momentum for the moving average: Q
```

## [Training Process](#contents)
### Training
#### Running on Ascend
```
bash scripts/run_standalone_pretrain_ascend.sh 0 1 /path/cn-wiki-128
```
The command above will run in the background, you can view training logs in pretraining_log.txt. After training finished, you will get some checkpoint files under the script folder by default. The loss values will be displayed as follows:
```
# grep "epoch" pretraining_log.txt
epoch: 0.0, current epoch percent: 0.000, step: 1, outpus are (Tensor(shape=[1], dtype=Float32, [ 1.0856101e+01]), Tensor(shape=[], dtype=Bool, False), Tensor(shape=[], dtype=Float32, 65536))
epoch: 0.0, current epoch percent: 0.000, step: 2, outpus are (Tensor(shape=[1], dtype=Float32, [ 1.0821701e+01]), Tensor(shape=[], dtype=Bool, False), Tensor(shape=[], dtype=Float32, 65536))
...
```

### Distributed Training
#### Running on Ascend
```
bash scripts/run_distributed_pretrain_ascend.sh /path/cn-wiki-128 /path/hccl.json
```
The command above will run in the background, you can view training logs in pretraining_log.txt. After training finished, you will get some checkpoint files under the LOG* folder by default. The loss value will be displayed as follows:
```
# grep "epoch" LOG*/pretraining_log.txt
epoch: 0.0, current epoch percent: 0.001, step: 100, outpus are (Tensor(shape=[1], dtype=Float32, [ 1.08209e+01]), Tensor(shape=[], dtype=Bool, False), Tensor(shape=[], dtype=Float32, 65536))
epoch: 0.0, current epoch percent: 0.002, step: 200, outpus are (Tensor(shape=[1], dtype=Float32, [ 1.07566e+01]), Tensor(shape=[], dtype=Bool, False), Tensor(shape=[], dtype=Float32, 65536))
...
epoch: 0.0, current epoch percent: 0.001, step: 100, outpus are (Tensor(shape=[1], dtype=Float32, [ 1.08218e+01]), Tensor(shape=[], dtype=Bool, False), Tensor(shape=[], dtype=Float32, 65536))
epoch: 0.0, current epoch percent: 0.002, step: 200, outpus are (Tensor(shape=[1], dtype=Float32, [ 1.07770e+01]), Tensor(shape=[], dtype=Bool, False), Tensor(shape=[], dtype=Float32, 65536))
...
```

## [Evaluation Process](#contents)
### Evaluation
#### evaluation on cola dataset when running on Ascend
Before running the command below, please check the load pretrain checkpoint path has been set. Please set the checkpoint path to be the absolute full path, e.g:"/username/pretrain/checkpoint_100_300.ckpt".
```
bash scripts/run_classifier.sh

The command above will run in the background, you can view training logs in classfier_log.txt.

If you choose accuracy as assessment method, the result will be as follows:
acc_num XXX, total_num XXX, accuracy 0.588986
```

#### evaluation on cluener dataset when running on Ascend    
```
bash scripts/ner.sh

The command above will run in the background, you can view training logs in ner_log.txt.

If you choose F1 as assessment method, the result will be as follows:
Precision 0.920507
Recall 0.948683
F1 0.920507
```
    
#### evaluation on squad v1.1 dataset when running on Ascend   
```
bash scripts/squad.sh

The command above will run in the background, you can view training logs in squad_log.txt.
The result will be as follows:
{"exact_match": 80.3878923040233284, "f1": 87.6902384023850329}
```

## [Model Description](#contents)
## [Performance](#contents)
### Pretraining Performance
| Parameters                 | Ascend                                                     | GPU                       |
| -------------------------- | ---------------------------------------------------------- | ------------------------- |
| Model Version              | BERT_base                                                  | BERT_base                 |
| Resource                   | Ascend 910, cpu:2.60GHz 56cores, memory:314G               | NV SMX2 V100-32G          |
| uploaded Date              | 08/22/2020                                                 | 05/06/2020                |
| MindSpore Version          | 0.6.0                                                      | 0.3.0                     |
| Dataset                    | cn-wiki-128(4000w)                                         | ImageNet                  |
| Training Parameters        | src/config.py                                              | src/config.py             |
| Optimizer                  | Lamb                                                       | Momentum                  |
| Loss Function              | SoftmaxCrossEntropy                                        | SoftmaxCrossEntropy       |
| outputs                    | probability                                                |                           |
| Epoch                      | 40                                                         |                           |                      |
| Batch_size                 | 256*8                                                      | 130(8P)                   |                      |
| Loss                       | 1.7                                                        | 1.913                     |
| Speed                      | 340ms/step                                                 | 1.913                     |
| Total time                 | 73h                                                        |                           |
| Params (M)                 | 110M                                                       |                           |
| Checkpoint for Fine tuning | 1.2G(.ckpt file)                                           |                           |


| Parameters                 | Ascend                                                     | GPU                       |
| -------------------------- | ---------------------------------------------------------- | ------------------------- |
| Model Version              | BERT_NEZHA                                                 | BERT_NEZHA                |
| Resource                   | Ascend 910, cpu:2.60GHz 56cores, memory:314G               | NV SMX2 V100-32G          |
| uploaded Date              | 08/20/2020                                                 | 05/06/2020                |
| MindSpore Version          | 0.6.0                                                      | 0.3.0                     |
| Dataset                    | cn-wiki-128(4000w)                                         | ImageNet                  |
| Training Parameters        | src/config.py                                              | src/config.py             |
| Optimizer                  | Lamb                                                       | Momentum                  |
| Loss Function              | SoftmaxCrossEntropy                                        | SoftmaxCrossEntropy       |
| outputs                    | probability                                                |                           |
| Epoch                      | 40                                                         |                           |                      |
| Batch_size                 | 96*8                                                       | 130(8P)                   |
| Loss                       | 1.7                                                        | 1.913                     |
| Speed                      | 360ms/step                                                 | 1.913                     |
| Total time                 | 200h                                                       |                           |
| Params (M)                 | 340M                                                       |                           |
| Checkpoint for Fine tuning | 3.2G(.ckpt file)                                           |                           |             

#### Inference Performance

| Parameters                 | Ascend                        | GPU                       |
| -------------------------- | ----------------------------- | ------------------------- | 
| Model Version              |                               |                           |                      
| Resource                   | Ascend 910                    | NV SMX2 V100-32G          | 
| uploaded Date              | 08/22/2020                    | 05/22/2020                |                      
| MindSpore Version          | 0.6.0                         | 0.2.0                     | 
| Dataset                    | cola, 1.2W                    | ImageNet, 1.2W            |
| batch_size                 | 32(1P)                        | 130(8P)                   |                      
| Accuracy                   | 0.588986                      | ACC1[72.07%] ACC5[90.90%] |                     
| Speed                      | 59.25ms/step                  |                           |                     
| Total time                 | 15min                         |                           |                     
| Model for inference        | 1.2G(.ckpt file)              |                           |                     

# [Description of Random Situation](#contents)

In run_standalone_pretrain.sh and run_distributed_pretrain.sh, we set do_shuffle to True to shuffle the dataset by default. 

In run_classifier.sh, run_ner.sh and run_squad.sh, we set train_data_shuffle and eval_data_shuffle to True to shuffle the dataset by default.

In config.py, we set the hidden_dropout_prob and attention_pros_dropout_prob to 0.1 to dropout some network node by default.

In run_pretrain.py, we set a random seed to make sure that each node has the same initial weight in distribute training.

# [ModelZoo Homepage](#contents)
 
Please check the official [homepage](https://gitee.com/mindspore/mindspore/tree/master/model_zoo).