You can not select more than 25 topics Topics must start with a chinese character,a letter or number, can include dashes ('-') and can be up to 35 characters long.

README.md 39 kB

4 years ago
5 years ago
5 years ago
5 years ago
5 years ago
5 years ago
5 years ago
5 years ago
5 years ago
5 years ago
5 years ago
4 years ago
4 years ago
5 years ago
5 years ago
4 years ago
5 years ago
5 years ago
4 years ago
4 years ago
5 years ago
5 years ago
4 years ago
4 years ago
4 years ago
4 years ago
5 years ago
5 years ago
5 years ago
5 years ago
5 years ago
5 years ago
5 years ago
5 years ago
5 years ago
5 years ago
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634635636637638639640641642643644645646647648649650651652653654655656657658659660661662663
  1. # Contents
  2. [查看中文](./README_CN.md)
  3. - [Contents](#contents)
  4. - [BERT Description](#bert-description)
  5. - [Model Architecture](#model-architecture)
  6. - [Dataset](#dataset)
  7. - [Environment Requirements](#environment-requirements)
  8. - [Quick Start](#quick-start)
  9. - [Script Description](#script-description)
  10. - [Script and Sample Code](#script-and-sample-code)
  11. - [Script Parameters](#script-parameters)
  12. - [Pre-Training](#pre-training)
  13. - [Fine-Tuning and Evaluation](#fine-tuning-and-evaluation)
  14. - [Options and Parameters](#options-and-parameters)
  15. - [Options](#options)
  16. - [Parameters](#parameters)
  17. - [Training Process](#training-process)
  18. - [Training](#training)
  19. - [Running on Ascend](#running-on-ascend)
  20. - [running on GPU](#running-on-gpu)
  21. - [Distributed Training](#distributed-training)
  22. - [Running on Ascend](#running-on-ascend-1)
  23. - [running on GPU](#running-on-gpu-1)
  24. - [Evaluation Process](#evaluation-process)
  25. - [Evaluation](#evaluation)
  26. - [evaluation on cola dataset when running on Ascend](#evaluation-on-cola-dataset-when-running-on-ascend)
  27. - [evaluation on cluener dataset when running on Ascend](#evaluation-on-cluener-dataset-when-running-on-ascend)
  28. - [evaluation on msra dataset when running on Ascend](#evaluation-on-msra-dataset-when-running-on-ascend)
  29. - [evaluation on squad v1.1 dataset when running on Ascend](#evaluation-on-squad-v11-dataset-when-running-on-ascend)
  30. - [Model Description](#model-description)
  31. - [Performance](#performance)
  32. - [Pretraining Performance](#pretraining-performance)
  33. - [Inference Performance](#inference-performance)
  34. - [Description of Random Situation](#description-of-random-situation)
  35. - [ModelZoo Homepage](#modelzoo-homepage)
  36. # [BERT Description](#contents)
  37. The BERT network was proposed by Google in 2018. The network has made a breakthrough in the field of NLP. The network uses pre-training to achieve a large network structure without modifying, and only by adding an output layer to achieve multiple text-based tasks in fine-tuning. The backbone code of BERT adopts the Encoder structure of Transformer. The attention mechanism is introduced to enable the output layer to capture high-latitude global semantic information. The pre-training uses denoising and self-encoding tasks, namely MLM(Masked Language Model) and NSP(Next Sentence Prediction). No need to label data, pre-training can be performed on massive text data, and only a small amount of data to fine-tuning downstream tasks to obtain good results. The pre-training plus fune-tuning mode created by BERT is widely adopted by subsequent NLP networks.
  38. [Paper](https://arxiv.org/abs/1810.04805): Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova. [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding]((https://arxiv.org/abs/1810.04805)). arXiv preprint arXiv:1810.04805.
  39. [Paper](https://arxiv.org/abs/1909.00204): Junqiu Wei, Xiaozhe Ren, Xiaoguang Li, Wenyong Huang, Yi Liao, Yasheng Wang, Jiashu Lin, Xin Jiang, Xiao Chen, Qun Liu. [NEZHA: Neural Contextualized Representation for Chinese Language Understanding](https://arxiv.org/abs/1909.00204). arXiv preprint arXiv:1909.00204.
  40. # [Model Architecture](#contents)
  41. The backbone structure of BERT is transformer. For BERT_base, the transformer contains 12 encoder modules, each module contains one self-attention module and each self-attention module contains one attention module. For BERT_NEZHA, the transformer contains 24 encoder modules, each module contains one self-attention module and each self-attention module contains one attention module. The difference between BERT_base and BERT_NEZHA is that BERT_base uses absolute position encoding to produce position embedding vector and BERT_NEZHA uses relative position encoding.
  42. # [Dataset](#contents)
  43. - Create pre-training dataset
  44. - Download the [zhwiki](https://dumps.wikimedia.org/zhwiki/) or [enwiki](https://dumps.wikimedia.org/enwiki/) dataset for pre-training.
  45. - Extract and refine texts in the dataset with [WikiExtractor](https://github.com/attardi/wikiextractor). The commands are as follows:
  46. - pip install wikiextractor
  47. - python -m wikiextractor.WikiExtractor -o <output file path> -b <output file size> <Wikipedia dump file>
  48. - Convert the dataset to TFRecord format. Please refer to create_pretraining_data.py file in [BERT](https://github.com/google-research/bert) repository and download vocab.txt here, if AttributeError: module 'tokenization' has no attribute 'FullTokenizer' occur, please install bert-tensorflow.
  49. - Create fine-tune dataset
  50. - Download dataset for fine-tuning and evaluation such as [CLUENER](https://github.com/CLUEbenchmark/CLUENER2020), [TNEWS](https://github.com/CLUEbenchmark/CLUE), [SQuAD v1.1 train dataset](https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json), [SQuAD v1.1 eval dataset](https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json), etc.
  51. - Convert dataset files from JSON format to TFRECORD format, please refer to run_classifier.py file in [BERT](https://github.com/google-research/bert) repository.
  52. # [Environment Requirements](#contents)
  53. - Hardware(Ascend/GPU)
  54. - Prepare hardware environment with Ascend/GPU processor.
  55. - Framework
  56. - [MindSpore](https://gitee.com/mindspore/mindspore)
  57. - For more information, please check the resources below:
  58. - [MindSpore Tutorials](https://www.mindspore.cn/tutorial/training/en/master/index.html)
  59. - [MindSpore Python API](https://www.mindspore.cn/doc/api_python/en/master/index.html)
  60. # [Quick Start](#contents)
  61. After installing MindSpore via the official website, you can start pre-training, fine-tuning and evaluation as follows:
  62. - Running on Ascend
  63. ```bash
  64. # run standalone pre-training example
  65. bash scripts/run_standalone_pretrain_ascend.sh 0 1 /path/cn-wiki-128
  66. # run distributed pre-training example
  67. bash scripts/run_distributed_pretrain_ascend.sh /path/cn-wiki-128 /path/hccl.json
  68. # run fine-tuning and evaluation example
  69. - If you are going to run a fine-tuning task, please prepare a checkpoint generated from pre-training.
  70. - Set bert network config and optimizer hyperparameters in `finetune_eval_config.py`.
  71. - Classification task: Set task related hyperparameters in scripts/run_classifier.sh.
  72. - Run `bash scripts/run_classifier.py` for fine-tuning of BERT-base and BERT-NEZHA model.
  73. bash scripts/run_classifier.sh
  74. - NER task: Set task related hyperparameters in scripts/run_ner.sh.
  75. - Run `bash scripts/run_ner.py` for fine-tuning of BERT-base and BERT-NEZHA model.
  76. bash scripts/run_ner.sh
  77. - SQuAD task: Set task related hyperparameters in scripts/run_squad.sh.
  78. - Run `bash scripts/run_squad.py` for fine-tuning of BERT-base and BERT-NEZHA model.
  79. bash scripts/run_squad.sh
  80. ```
  81. - Running on GPU
  82. ```bash
  83. # run standalone pre-training example
  84. bash run_standalone_pretrain_for_gpu.sh 0 1 /path/cn-wiki-128
  85. # run distributed pre-training example
  86. bash scripts/run_distributed_pretrain_for_gpu.sh 8 40 /path/cn-wiki-128
  87. # run fine-tuning and evaluation example
  88. - If you are going to run a fine-tuning task, please prepare a checkpoint generated from pre-training.
  89. - Set bert network config and optimizer hyperparameters in `finetune_eval_config.py`.
  90. - Classification task: Set task related hyperparameters in scripts/run_classifier.sh.
  91. - Run `bash scripts/run_classifier.py` for fine-tuning of BERT-base and BERT-NEZHA model.
  92. bash scripts/run_classifier.sh
  93. - NER task: Set task related hyperparameters in scripts/run_ner.sh.
  94. - Run `bash scripts/run_ner.py` for fine-tuning of BERT-base and BERT-NEZHA model.
  95. bash scripts/run_ner.sh
  96. - SQuAD task: Set task related hyperparameters in scripts/run_squad.sh.
  97. - Run `bash scripts/run_squad.py` for fine-tuning of BERT-base and BERT-NEZHA model.
  98. bash scripts/run_squad.sh
  99. ```
  100. For distributed training on Ascend, an hccl configuration file with JSON format needs to be created in advance.
  101. For distributed training on single machine, [here](https://gitee.com/mindspore/mindspore/tree/master/config/hccl_single_machine_multi_rank.json) is an example hccl.json.
  102. For distributed training among multiple machines, training command should be executed on each machine in a small time interval. Thus, an hccl.json is needed on each machine. [here](https://gitee.com/mindspore/mindspore/tree/master/config/hccl_multi_machine_multi_rank.json) is an example of hccl.json for multi-machine case.
  103. Please follow the instructions in the link below to create an hccl.json file in need:
  104. [https://gitee.com/mindspore/mindspore/tree/master/model_zoo/utils/hccl_tools](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/utils/hccl_tools).
  105. For dataset, if you want to set the format and parameters, a schema configuration file with JSON format needs to be created, please refer to [tfrecord](https://www.mindspore.cn/doc/programming_guide/zh-CN/master/dataset_loading.html#tfrecord) format.
  106. ```text
  107. For pretraining, schema file contains ["input_ids", "input_mask", "segment_ids", "next_sentence_labels", "masked_lm_positions", "masked_lm_ids", "masked_lm_weights"].
  108. For ner or classification task, schema file contains ["input_ids", "input_mask", "segment_ids", "label_ids"].
  109. For squad task, training: schema file contains ["start_positions", "end_positions", "input_ids", "input_mask", "segment_ids"], evaluation: schema file contains ["input_ids", "input_mask", "segment_ids"].
  110. `numRows` is the only option which could be set by user, other values must be set according to the dataset.
  111. For example, the schema file of cn-wiki-128 dataset for pretraining shows as follows:
  112. {
  113. "datasetType": "TF",
  114. "numRows": 7680,
  115. "columns": {
  116. "input_ids": {
  117. "type": "int64",
  118. "rank": 1,
  119. "shape": [128]
  120. },
  121. "input_mask": {
  122. "type": "int64",
  123. "rank": 1,
  124. "shape": [128]
  125. },
  126. "segment_ids": {
  127. "type": "int64",
  128. "rank": 1,
  129. "shape": [128]
  130. },
  131. "next_sentence_labels": {
  132. "type": "int64",
  133. "rank": 1,
  134. "shape": [1]
  135. },
  136. "masked_lm_positions": {
  137. "type": "int64",
  138. "rank": 1,
  139. "shape": [20]
  140. },
  141. "masked_lm_ids": {
  142. "type": "int64",
  143. "rank": 1,
  144. "shape": [20]
  145. },
  146. "masked_lm_weights": {
  147. "type": "float32",
  148. "rank": 1,
  149. "shape": [20]
  150. }
  151. }
  152. }
  153. ```
  154. # [Script Description](#contents)
  155. ## [Script and Sample Code](#contents)
  156. ```shell
  157. .
  158. └─bert
  159. ├─README.md
  160. ├─scripts
  161. ├─ascend_distributed_launcher
  162. ├─__init__.py
  163. ├─hyper_parameter_config.ini # hyper parameter for distributed pretraining
  164. ├─get_distribute_pretrain_cmd.py # script for distributed pretraining
  165. ├─README.md
  166. ├─run_classifier.sh # shell script for standalone classifier task on ascend or gpu
  167. ├─run_ner.sh # shell script for standalone NER task on ascend or gpu
  168. ├─run_squad.sh # shell script for standalone SQUAD task on ascend or gpu
  169. ├─run_standalone_pretrain_ascend.sh # shell script for standalone pretrain on ascend
  170. ├─run_distributed_pretrain_ascend.sh # shell script for distributed pretrain on ascend
  171. ├─run_distributed_pretrain_gpu.sh # shell script for distributed pretrain on gpu
  172. └─run_standaloned_pretrain_gpu.sh # shell script for distributed pretrain on gpu
  173. ├─src
  174. ├─__init__.py
  175. ├─assessment_method.py # assessment method for evaluation
  176. ├─bert_for_finetune.py # backbone code of network
  177. ├─bert_for_pre_training.py # backbone code of network
  178. ├─bert_model.py # backbone code of network
  179. ├─finetune_data_preprocess.py # data preprocessing
  180. ├─cluner_evaluation.py # evaluation for cluner
  181. ├─config.py # parameter configuration for pretraining
  182. ├─CRF.py # assessment method for clue dataset
  183. ├─dataset.py # data preprocessing
  184. ├─finetune_eval_config.py # parameter configuration for finetuning
  185. ├─finetune_eval_model.py # backbone code of network
  186. ├─sample_process.py # sample processing
  187. ├─utils.py # util function
  188. ├─pretrain_eval.py # train and eval net
  189. ├─run_classifier.py # finetune and eval net for classifier task
  190. ├─run_ner.py # finetune and eval net for ner task
  191. ├─run_pretrain.py # train net for pretraining phase
  192. └─run_squad.py # finetune and eval net for squad task
  193. ```
  194. ## [Script Parameters](#contents)
  195. ### Pre-Training
  196. ```text
  197. usage: run_pretrain.py [--distribute DISTRIBUTE] [--epoch_size N] [----device_num N] [--device_id N]
  198. [--enable_save_ckpt ENABLE_SAVE_CKPT] [--device_target DEVICE_TARGET]
  199. [--enable_lossscale ENABLE_LOSSSCALE] [--do_shuffle DO_SHUFFLE]
  200. [--enable_data_sink ENABLE_DATA_SINK] [--data_sink_steps N]
  201. [--accumulation_steps N]
  202. [--allreduce_post_accumulation ALLREDUCE_POST_ACCUMULATION]
  203. [--save_checkpoint_path SAVE_CHECKPOINT_PATH]
  204. [--load_checkpoint_path LOAD_CHECKPOINT_PATH]
  205. [--save_checkpoint_steps N] [--save_checkpoint_num N]
  206. [--data_dir DATA_DIR] [--schema_dir SCHEMA_DIR] [train_steps N]
  207. options:
  208. --device_target device where the code will be implemented: "Ascend" | "GPU", default is "Ascend"
  209. --distribute pre_training by several devices: "true"(training by more than 1 device) | "false", default is "false"
  210. --epoch_size epoch size: N, default is 1
  211. --device_num number of used devices: N, default is 1
  212. --device_id device id: N, default is 0
  213. --enable_save_ckpt enable save checkpoint: "true" | "false", default is "true"
  214. --enable_lossscale enable lossscale: "true" | "false", default is "true"
  215. --do_shuffle enable shuffle: "true" | "false", default is "true"
  216. --enable_data_sink enable data sink: "true" | "false", default is "true"
  217. --data_sink_steps set data sink steps: N, default is 1
  218. --accumulation_steps accumulate gradients N times before weight update: N, default is 1
  219. --allreduce_post_accumulation allreduce after accumulation of N steps or after each step: "true" | "false", default is "true"
  220. --save_checkpoint_path path to save checkpoint files: PATH, default is ""
  221. --load_checkpoint_path path to load checkpoint files: PATH, default is ""
  222. --save_checkpoint_steps steps for saving checkpoint files: N, default is 1000
  223. --save_checkpoint_num number for saving checkpoint files: N, default is 1
  224. --train_steps Training Steps: N, default is -1
  225. --data_dir path to dataset directory: PATH, default is ""
  226. --schema_dir path to schema.json file, PATH, default is ""
  227. ```
  228. ### Fine-Tuning and Evaluation
  229. ```text
  230. usage: run_ner.py [--device_target DEVICE_TARGET] [--do_train DO_TRAIN] [----do_eval DO_EVAL]
  231. [--assessment_method ASSESSMENT_METHOD] [--use_crf USE_CRF]
  232. [--device_id N] [--epoch_num N] [--vocab_file_path VOCAB_FILE_PATH]
  233. [--label2id_file_path LABEL2ID_FILE_PATH]
  234. [--train_data_shuffle TRAIN_DATA_SHUFFLE]
  235. [--eval_data_shuffle EVAL_DATA_SHUFFLE]
  236. [--save_finetune_checkpoint_path SAVE_FINETUNE_CHECKPOINT_PATH]
  237. [--load_pretrain_checkpoint_path LOAD_PRETRAIN_CHECKPOINT_PATH]
  238. [--train_data_file_path TRAIN_DATA_FILE_PATH]
  239. [--eval_data_file_path EVAL_DATA_FILE_PATH]
  240. [--schema_file_path SCHEMA_FILE_PATH]
  241. options:
  242. --device_target device where the code will be implemented: "Ascend" | "GPU", default is "Ascend"
  243. --do_train whether to run training on training set: true | false
  244. --do_eval whether to run eval on dev set: true | false
  245. --assessment_method assessment method to do evaluation: f1 | clue_benchmark
  246. --use_crf whether to use crf to calculate loss: true | false
  247. --device_id device id to run task
  248. --epoch_num total number of training epochs to perform
  249. --train_data_shuffle Enable train data shuffle, default is true
  250. --eval_data_shuffle Enable eval data shuffle, default is true
  251. --vocab_file_path the vocabulary file that the BERT model was trained on
  252. --label2id_file_path label to id file, each label name must be consistent with the type name labeled in the original dataset file
  253. --save_finetune_checkpoint_path path to save generated finetuning checkpoint
  254. --load_pretrain_checkpoint_path initial checkpoint (usually from a pre-trained BERT model)
  255. --load_finetune_checkpoint_path give a finetuning checkpoint path if only do eval
  256. --train_data_file_path ner tfrecord for training. E.g., train.tfrecord
  257. --eval_data_file_path ner tfrecord for predictions if f1 is used to evaluate result, ner json for predictions if clue_benchmark is used to evaluate result
  258. --dataset_format dataset format, support mindrecord or tfrecord
  259. --schema_file_path path to datafile schema file
  260. usage: run_squad.py [--device_target DEVICE_TARGET] [--do_train DO_TRAIN] [----do_eval DO_EVAL]
  261. [--device_id N] [--epoch_num N] [--num_class N]
  262. [--vocab_file_path VOCAB_FILE_PATH]
  263. [--eval_json_path EVAL_JSON_PATH]
  264. [--train_data_shuffle TRAIN_DATA_SHUFFLE]
  265. [--eval_data_shuffle EVAL_DATA_SHUFFLE]
  266. [--save_finetune_checkpoint_path SAVE_FINETUNE_CHECKPOINT_PATH]
  267. [--load_pretrain_checkpoint_path LOAD_PRETRAIN_CHECKPOINT_PATH]
  268. [--load_finetune_checkpoint_path LOAD_FINETUNE_CHECKPOINT_PATH]
  269. [--train_data_file_path TRAIN_DATA_FILE_PATH]
  270. [--eval_data_file_path EVAL_DATA_FILE_PATH]
  271. [--schema_file_path SCHEMA_FILE_PATH]
  272. options:
  273. --device_target device where the code will be implemented: "Ascend" | "GPU", default is "Ascend"
  274. --do_train whether to run training on training set: true | false
  275. --do_eval whether to run eval on dev set: true | false
  276. --device_id device id to run task
  277. --epoch_num total number of training epochs to perform
  278. --num_class number of classes to classify, usually 2 for squad task
  279. --train_data_shuffle Enable train data shuffle, default is true
  280. --eval_data_shuffle Enable eval data shuffle, default is true
  281. --vocab_file_path the vocabulary file that the BERT model was trained on
  282. --eval_json_path path to squad dev json file
  283. --save_finetune_checkpoint_path path to save generated finetuning checkpoint
  284. --load_pretrain_checkpoint_path initial checkpoint (usually from a pre-trained BERT model)
  285. --load_finetune_checkpoint_path give a finetuning checkpoint path if only do eval
  286. --train_data_file_path squad tfrecord for training. E.g., train1.1.tfrecord
  287. --eval_data_file_path squad tfrecord for predictions. E.g., dev1.1.tfrecord
  288. --schema_file_path path to datafile schema file
  289. usage: run_classifier.py [--device_target DEVICE_TARGET] [--do_train DO_TRAIN] [----do_eval DO_EVAL]
  290. [--assessment_method ASSESSMENT_METHOD] [--device_id N] [--epoch_num N] [--num_class N]
  291. [--save_finetune_checkpoint_path SAVE_FINETUNE_CHECKPOINT_PATH]
  292. [--load_pretrain_checkpoint_path LOAD_PRETRAIN_CHECKPOINT_PATH]
  293. [--load_finetune_checkpoint_path LOAD_FINETUNE_CHECKPOINT_PATH]
  294. [--train_data_shuffle TRAIN_DATA_SHUFFLE]
  295. [--eval_data_shuffle EVAL_DATA_SHUFFLE]
  296. [--train_data_file_path TRAIN_DATA_FILE_PATH]
  297. [--eval_data_file_path EVAL_DATA_FILE_PATH]
  298. [--schema_file_path SCHEMA_FILE_PATH]
  299. options:
  300. --device_target targeted device to run task: Ascend | GPU
  301. --do_train whether to run training on training set: true | false
  302. --do_eval whether to run eval on dev set: true | false
  303. --assessment_method assessment method to do evaluation: accuracy | f1 | mcc | spearman_correlation
  304. --device_id device id to run task
  305. --epoch_num total number of training epochs to perform
  306. --num_class number of classes to do labeling
  307. --train_data_shuffle Enable train data shuffle, default is true
  308. --eval_data_shuffle Enable eval data shuffle, default is true
  309. --save_finetune_checkpoint_path path to save generated finetuning checkpoint
  310. --load_pretrain_checkpoint_path initial checkpoint (usually from a pre-trained BERT model)
  311. --load_finetune_checkpoint_path give a finetuning checkpoint path if only do eval
  312. --train_data_file_path tfrecord for training. E.g., train.tfrecord
  313. --eval_data_file_path tfrecord for predictions. E.g., dev.tfrecord
  314. --schema_file_path path to datafile schema file
  315. ```
  316. ## Options and Parameters
  317. Parameters for training and evaluation can be set in file `config.py` and `finetune_eval_config.py` respectively.
  318. ### Options
  319. ```text
  320. config for lossscale and etc.
  321. bert_network version of BERT model: base | nezha, default is base
  322. batch_size batch size of input dataset: N, default is 16
  323. loss_scale_value initial value of loss scale: N, default is 2^32
  324. scale_factor factor used to update loss scale: N, default is 2
  325. scale_window steps for once updatation of loss scale: N, default is 1000
  326. optimizer optimizer used in the network: AdamWerigtDecayDynamicLR | Lamb | Momentum, default is "Lamb"
  327. ```
  328. ### Parameters
  329. ```text
  330. Parameters for dataset and network (Pre-Training/Fine-Tuning/Evaluation):
  331. seq_length length of input sequence: N, default is 128
  332. vocab_size size of each embedding vector: N, must be consistent with the dataset you use. Default is 21128.
  333. Usually, we use 21128 for CN vocabs and 30522 for EN vocabs according to the origin paper.
  334. hidden_size size of bert encoder layers: N, default is 768
  335. num_hidden_layers number of hidden layers: N, default is 12
  336. num_attention_heads number of attention heads: N, default is 12
  337. intermediate_size size of intermediate layer: N, default is 3072
  338. hidden_act activation function used: ACTIVATION, default is "gelu"
  339. hidden_dropout_prob dropout probability for BertOutput: Q, default is 0.1
  340. attention_probs_dropout_prob dropout probability for BertAttention: Q, default is 0.1
  341. max_position_embeddings maximum length of sequences: N, default is 512
  342. type_vocab_size size of token type vocab: N, default is 16
  343. initializer_range initialization value of TruncatedNormal: Q, default is 0.02
  344. use_relative_positions use relative positions or not: True | False, default is False
  345. dtype data type of input: mstype.float16 | mstype.float32, default is mstype.float32
  346. compute_type compute type in BertTransformer: mstype.float16 | mstype.float32, default is mstype.float16
  347. Parameters for optimizer:
  348. AdamWeightDecay:
  349. decay_steps steps of the learning rate decay: N
  350. learning_rate value of learning rate: Q
  351. end_learning_rate value of end learning rate: Q, must be positive
  352. power power: Q
  353. warmup_steps steps of the learning rate warm up: N
  354. weight_decay weight decay: Q
  355. eps term added to the denominator to improve numerical stability: Q
  356. Lamb:
  357. decay_steps steps of the learning rate decay: N
  358. learning_rate value of learning rate: Q
  359. end_learning_rate value of end learning rate: Q
  360. power power: Q
  361. warmup_steps steps of the learning rate warm up: N
  362. weight_decay weight decay: Q
  363. Momentum:
  364. learning_rate value of learning rate: Q
  365. momentum momentum for the moving average: Q
  366. ```
  367. ## [Training Process](#contents)
  368. ### Training
  369. #### Running on Ascend
  370. ```bash
  371. bash scripts/run_standalone_pretrain_ascend.sh 0 1 /path/cn-wiki-128
  372. ```
  373. The command above will run in the background, you can view training logs in pretraining_log.txt. After training finished, you will get some checkpoint files under the script folder by default. The loss values will be displayed as follows:
  374. ```text
  375. # grep "epoch" pretraining_log.txt
  376. epoch: 0.0, current epoch percent: 0.000, step: 1, outputs are (Tensor(shape=[1], dtype=Float32, [ 1.0856101e+01]), Tensor(shape=[], dtype=Bool, False), Tensor(shape=[], dtype=Float32, 65536))
  377. epoch: 0.0, current epoch percent: 0.000, step: 2, outputs are (Tensor(shape=[1], dtype=Float32, [ 1.0821701e+01]), Tensor(shape=[], dtype=Bool, False), Tensor(shape=[], dtype=Float32, 65536))
  378. ...
  379. ```
  380. #### running on GPU
  381. ```bash
  382. bash scripts/run_standalone_pretrain_for_gpu.sh 0 1 /path/cn-wiki-128
  383. ```
  384. The command above will run in the background, you can view the results the file pretraining_log.txt. After training, you will get some checkpoint files under the script folder by default. The loss value will be achieved as follows:
  385. ```bash
  386. # grep "epoch" pretraining_log.txt
  387. epoch: 0.0, current epoch percent: 0.000, step: 1, outputs are (Tensor(shape=[1], dtype=Float32, [ 1.0856101e+01]), Tensor(shape=[], dtype=Bool, False), Tensor(shape=[], dtype=Float32, 65536))
  388. epoch: 0.0, current epoch percent: 0.000, step: 2, outputs are (Tensor(shape=[1], dtype=Float32, [ 1.0821701e+01]), Tensor(shape=[], dtype=Bool, False), Tensor(shape=[], dtype=Float32, 65536))
  389. ...
  390. ```
  391. > **Attention** If you are running with a huge dataset on Ascend, it's better to add an external environ variable to make sure the hccl won't timeout.
  392. >
  393. > ```bash
  394. > export HCCL_CONNECT_TIMEOUT=600
  395. > ```
  396. >
  397. > This will extend the timeout limits of hccl from the default 120 seconds to 600 seconds.
  398. > **Attention** If you are running with a big bert model, some error of protobuf may occurs while saving checkpoints, try with the following environ set.
  399. >
  400. > ```bash
  401. > export PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python
  402. > ```
  403. ### Distributed Training
  404. #### Running on Ascend
  405. ```bash
  406. bash scripts/run_distributed_pretrain_ascend.sh /path/cn-wiki-128 /path/hccl.json
  407. ```
  408. The command above will run in the background, you can view training logs in pretraining_log.txt. After training finished, you will get some checkpoint files under the LOG* folder by default. The loss value will be displayed as follows:
  409. ```bash
  410. # grep "epoch" LOG*/pretraining_log.txt
  411. epoch: 0.0, current epoch percent: 0.001, step: 100, outputs are (Tensor(shape=[1], dtype=Float32, [ 1.08209e+01]), Tensor(shape=[], dtype=Bool, False), Tensor(shape=[], dtype=Float32, 65536))
  412. epoch: 0.0, current epoch percent: 0.002, step: 200, outputs are (Tensor(shape=[1], dtype=Float32, [ 1.07566e+01]), Tensor(shape=[], dtype=Bool, False), Tensor(shape=[], dtype=Float32, 65536))
  413. ...
  414. epoch: 0.0, current epoch percent: 0.001, step: 100, outputs are (Tensor(shape=[1], dtype=Float32, [ 1.08218e+01]), Tensor(shape=[], dtype=Bool, False), Tensor(shape=[], dtype=Float32, 65536))
  415. epoch: 0.0, current epoch percent: 0.002, step: 200, outputs are (Tensor(shape=[1], dtype=Float32, [ 1.07770e+01]), Tensor(shape=[], dtype=Bool, False), Tensor(shape=[], dtype=Float32, 65536))
  416. ...
  417. ```
  418. #### running on GPU
  419. ```bash
  420. bash scripts/run_distributed_pretrain_for_gpu.sh /path/cn-wiki-128
  421. ```
  422. The command above will run in the background, you can view the results the file pretraining_log.txt. After training, you will get some checkpoint files under the LOG* folder by default. The loss value will be achieved as follows:
  423. ```bash
  424. # grep "epoch" LOG*/pretraining_log.txt
  425. epoch: 0.0, current epoch percent: 0.001, step: 100, outputs are (Tensor(shape=[1], dtype=Float32, [ 1.08209e+01]), Tensor(shape=[], dtype=Bool, False), Tensor(shape=[], dtype=Float32, 65536))
  426. epoch: 0.0, current epoch percent: 0.002, step: 200, outputs are (Tensor(shape=[1], dtype=Float32, [ 1.07566e+01]), Tensor(shape=[], dtype=Bool, False), Tensor(shape=[], dtype=Float32, 65536))
  427. ...
  428. epoch: 0.0, current epoch percent: 0.001, step: 100, outputs are (Tensor(shape=[1], dtype=Float32, [ 1.08218e+01]), Tensor(shape=[], dtype=Bool, False), Tensor(shape=[], dtype=Float32, 65536))
  429. epoch: 0.0, current epoch percent: 0.002, step: 200, outputs are (Tensor(shape=[1], dtype=Float32, [ 1.07770e+01]), Tensor(shape=[], dtype=Bool, False), Tensor(shape=[], dtype=Float32, 65536))
  430. ...
  431. ```
  432. > **Attention** This will bind the processor cores according to the `device_num` and total processor numbers. If you don't expect to run pretraining with binding processor cores, remove the operations about `taskset` in `scripts/ascend_distributed_launcher/get_distribute_pretrain_cmd.py`
  433. ## [Evaluation Process](#contents)
  434. ### Evaluation
  435. #### evaluation on cola dataset when running on Ascend
  436. Before running the command below, please check the load pretrain checkpoint path has been set. Please set the checkpoint path to be the absolute full path, e.g:"/username/pretrain/checkpoint_100_300.ckpt".
  437. ```bash
  438. bash scripts/run_classifier.sh
  439. ```
  440. The command above will run in the background, you can view training logs in classfier_log.txt.
  441. If you choose accuracy as assessment method, the result will be as follows:
  442. ```text
  443. acc_num XXX, total_num XXX, accuracy 0.588986
  444. ```
  445. #### evaluation on cluener dataset when running on Ascend
  446. ```bash
  447. bash scripts/run_ner.sh
  448. ```
  449. The command above will run in the background, you can view training logs in ner_log.txt.
  450. If you choose F1 as assessment method, the result will be as follows:
  451. ```text
  452. Precision 0.920507
  453. Recall 0.948683
  454. F1 0.920507
  455. ```
  456. #### evaluation on msra dataset when running on Ascend
  457. For preprocess, you can first convert the original txt format of MSRA dataset into mindrecord by run the command as below (please keep in mind that the label names in label2id_file should be consistent with the type names labeled in the original msra_dataset.xml dataset file):
  458. ```python
  459. python src/finetune_data_preprocess.py --data_dir=/path/msra_dataset.xml --vocab_file=/path/vacab_file --save_path=/path/msra_dataset.mindrecord --label2id=/path/label2id_file --max_seq_len=seq_len --class_filter="NAMEX" --split_begin=0.0 --split_end=1.0
  460. ```
  461. For finetune and evaluation, just do
  462. ```bash
  463. bash scripts/run_ner.sh
  464. ```
  465. The command above will run in the background, you can view training logs in ner_log.txt.
  466. If you choose MF1(F1 score with multi-labels) as assessment method, the result will be as follows if evaluation is done after finetuning 10 epoches:
  467. ```text
  468. F1 0.931243
  469. ```
  470. #### evaluation on squad v1.1 dataset when running on Ascend
  471. ```bash
  472. bash scripts/squad.sh
  473. ```
  474. The command above will run in the background, you can view training logs in squad_log.txt.
  475. The result will be as follows:
  476. ```text
  477. {"exact_match": 80.3878923040233284, "f1": 87.6902384023850329}
  478. ```
  479. ## [Model Description](#contents)
  480. ## [Performance](#contents)
  481. ### Pretraining Performance
  482. | Parameters | Ascend | GPU |
  483. | -------------------------- | ---------------------------------------------------------- | ------------------------- |
  484. | Model Version | BERT_base | BERT_base |
  485. | Resource | Ascend 910; cpu 2.60GHz, 192cores; memory 755G; OS Euler2.8 | NV SMX2 V100-16G, cpu: Intel(R) Xeon(R) Platinum 8160 CPU @2.10GHz, memory: 256G |
  486. | uploaded Date | 08/22/2020 | 05/06/2020 |
  487. | MindSpore Version | 1.0.0 | 1.0.0 |
  488. | Dataset | cn-wiki-128(4000w) | cn-wiki-128(4000w) |
  489. | Training Parameters | src/config.py | src/config.py |
  490. | Optimizer | Lamb | AdamWeightDecay |
  491. | Loss Function | SoftmaxCrossEntropy | SoftmaxCrossEntropy |
  492. | outputs | probability | probability |
  493. | Epoch | 40 | 40 |
  494. | Batch_size | 256*8 | 32*8 |
  495. | Loss | 1.7 | 1.7 |
  496. | Speed | 340ms/step | 290ms/step |
  497. | Total time | 73h | 610H |
  498. | Params (M) | 110M | 110M |
  499. | Checkpoint for Fine tuning | 1.2G(.ckpt file) | 1.2G(.ckpt file) |
  500. | Scripts | [BERT_base](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/official/nlp/bert) | [BERT_base](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/official/nlp/bert) |
  501. | Parameters | Ascend |
  502. | -------------------------- | ---------------------------------------------------------- |
  503. | Model Version | BERT_NEZHA |
  504. | Resource | Ascend 910; cpu 2.60GHz, 192cores; memory 755G; OS Euler2.8 |
  505. | uploaded Date | 08/20/2020 |
  506. | MindSpore Version | 1.0.0 |
  507. | Dataset | cn-wiki-128(4000w) |
  508. | Training Parameters | src/config.py |
  509. | Optimizer | Lamb |
  510. | Loss Function | SoftmaxCrossEntropy |
  511. | outputs | probability |
  512. | Epoch | 40 |
  513. | Batch_size | 96*8 |
  514. | Loss | 1.7 |
  515. | Speed | 360ms/step |
  516. | Total time | 200h |
  517. | Params (M) | 340M |
  518. | Checkpoint for Fine tuning | 3.2G(.ckpt file) |
  519. | Scripts | [BERT_NEZHA](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/official/nlp/bert) |
  520. #### Inference Performance
  521. | Parameters | Ascend |
  522. | -------------------------- | ----------------------------- |
  523. | Model Version | |
  524. | Resource | Ascend 910; OS Euler2.8 |
  525. | uploaded Date | 08/22/2020 |
  526. | MindSpore Version | 1.0.0 |
  527. | Dataset | cola, 1.2W |
  528. | batch_size | 32(1P) |
  529. | Accuracy | 0.588986 |
  530. | Speed | 59.25ms/step |
  531. | Total time | 15min |
  532. | Model for inference | 1.2G(.ckpt file) |
  533. # [Description of Random Situation](#contents)
  534. In run_standalone_pretrain.sh and run_distributed_pretrain.sh, we set do_shuffle to True to shuffle the dataset by default.
  535. In run_classifier.sh, run_ner.sh and run_squad.sh, we set train_data_shuffle and eval_data_shuffle to True to shuffle the dataset by default.
  536. In config.py, we set the hidden_dropout_prob and attention_pros_dropout_prob to 0.1 to dropout some network node by default.
  537. In run_pretrain.py, we set a random seed to make sure that each node has the same initial weight in distribute training.
  538. # [ModelZoo Homepage](#contents)
  539. Please check the official [homepage](https://gitee.com/mindspore/mindspore/tree/master/model_zoo).