You can not select more than 25 topics Topics must start with a chinese character,a letter or number, can include dashes ('-') and can be up to 35 characters long.

README.md 20 kB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319
  1. # Contents
  2. - [TinyBERT Description](#tinybert-description)
  3. - [Model Architecture](#model-architecture)
  4. - [Dataset](#dataset)
  5. - [Environment Requirements](#environment-requirements)
  6. - [Quick Start](#quick-start)
  7. - [Script Description](#script-description)
  8. - [Script and Sample Code](#script-and-sample-code)
  9. - [Script Parameters](#script-parameters)
  10. - [Dataset Preparation](#dataset-preparation)
  11. - [Training Process](#training-process)
  12. - [Evaluation Process](#evaluation-process)
  13. - [Model Description](#model-description)
  14. - [Performance](#performance)
  15. - [Training Performance](#training-performance)
  16. - [Evaluation Performance](#evaluation-performance)
  17. - [Description of Random Situation](#description-of-random-situation)
  18. - [ModelZoo Homepage](#modelzoo-homepage)
  19. # [TinyBERT Description](#contents)
  20. [TinyBERT](https://github.com/huawei-noah/Pretrained-Language-Model/tree/master/TinyBERT) is 7.5x smalller and 9.4x faster on inference than [BERT-base](https://github.com/google-research/bert) (the base version of BERT model) and achieves competitive performances in the tasks of natural language understanding. It performs a novel transformer distillation at both the pre-training and task-specific learning stages.
  21. [Paper](https://arxiv.org/abs/1909.10351): Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, Qun Liu. [TinyBERT: Distilling BERT for Natural Language Understanding](https://arxiv.org/abs/1909.10351). arXiv preprint arXiv:1909.10351.
  22. # [Model Architecture](#contents)
  23. The backbone structure of TinyBERT is transformer, the transformer contains four encoder modules, one encoder contains one selfattention module and one selfattention module contains one attention module.
  24. # [Dataset](#contents)
  25. - Download the zhwiki or enwiki dataset for general distillation. Extract and clean text in the dataset with [WikiExtractor](https://github.com/attardi/wikiextractor). Convert the dataset to TFRecord format, please refer to create_pretraining_data.py which in [BERT](https://github.com/google-research/bert) repository.
  26. - Download glue dataset for task distillation. Convert dataset files from json format to tfrecord format, please refer to run_classifier.py which in [BERT](https://github.com/google-research/bert) repository.
  27. # [Environment Requirements](#contents)
  28. - Hardware(Ascend)
  29. - Prepare hardware environment with Ascend processor. If you want to try Ascend, please send the [application form](https://obs-9be7.obs.cn-east-2.myhuaweicloud.com/file/other/Ascend%20Model%20Zoo%E4%BD%93%E9%AA%8C%E8%B5%84%E6%BA%90%E7%94%B3%E8%AF%B7%E8%A1%A8.docx) to ascend@huawei.com. Once approved, you can get the resources.
  30. - Framework
  31. - [MindSpore](https://gitee.com/mindspore/mindspore)
  32. - For more information, please check the resources below:
  33. - [MindSpore tutorials](https://www.mindspore.cn/tutorial/en/master/index.html)
  34. - [MindSpore API](https://www.mindspore.cn/api/en/master/index.html)
  35. # [Quick Start](#contents)
  36. After installing MindSpore via the official website, you can start general distill, task distill and evaluation as follows:
  37. ```bash
  38. # run standalone general distill example
  39. bash scripts/run_standalone_gd_ascend.sh
  40. Before running the shell script, please set the `load_teacher_ckpt_path`, `data_dir` and `schema_dir` in the run_standalone_gd_ascend.sh file first.
  41. # run distributed general distill example
  42. bash scripts/run_distributed_gd_ascend.sh 8 1 /path/hccl.json
  43. Before running the shell script, please set the `load_teacher_ckpt_path`, `data_dir` and `schema_dir` in the run_distributed_gd_ascend.sh file first.
  44. # run task distill and evaluation example
  45. bash scripts/run_standalone_td_ascend.sh
  46. Before running the shell script, please set the `task_name`, `load_teacher_ckpt_path`, `load_gd_ckpt_path`, `train_data_dir`, `eval_data_dir` and `schema_dir` in the run_standalone_td_ascend.sh file first.
  47. ```
  48. For distributed training, a hccl configuration file with JSON format needs to be created in advance.
  49. Please follow the instructions in the link below:
  50. https:gitee.com/mindspore/mindspore/tree/master/model_zoo/utils/hccl_tools.
  51. # [Script Description](#contents)
  52. ## [Script and Sample Code](#contents)
  53. ```shell
  54. .
  55. └─bert
  56. ├─README.md
  57. ├─scripts
  58. ├─run_distributed_gd_ascend.sh # shell script for distributed general distill phase
  59. ├─run_distributed_gd_for_gpu.sh # shell script for distributed general distill phase
  60. ├─run_standalone_gd_ascend.sh # shell script for standalone general distill phase
  61. ├─run_standalone_td_ascend.sh # shell script for standalone task distill phase
  62. ├─src
  63. ├─__init__.py
  64. ├─assessment_method.py # assessment method for evaluation
  65. ├─dataset.py # data processing
  66. ├─fused_layer_norm.py # Layernormal is optimized for Ascend
  67. ├─gd_config.py # parameter configuration for general distill phase
  68. ├─td_config.py # parameter configuration for task distill phase
  69. ├─tinybert_for_gd_td.py # backbone code of network
  70. ├─tinybert_model.py # backbone code of network
  71. ├─utils.py # util function
  72. ├─__init__.py
  73. ├─run_general_distill.py # train net for general distillation
  74. ├─run_task_distill.py # train and eval net for task distillation
  75. ```
  76. ## [Script Parameters](#contents)
  77. ### General Distill
  78. ```
  79. usage: run_general_distill_ascend.py [--distribute DISTRIBUTE] [--epoch_size N] [----device_num N] [--device_id N]
  80. [--device_target DEVICE_TARGET] [--do_shuffle DO_SHUFFLE]
  81. [--enable_data_sink ENABLE_DATA_SINK] [--data_sink_steps N]
  82. [--save_ckpt_path SAVE_CKPT_PATH]
  83. [--load_teacher_ckpt_path LOAD_TEACHER_CKPT_PATH]
  84. [--save_checkpoint_step N] [--max_ckpt_num N]
  85. [--data_dir DATA_DIR] [--schema_dir SCHEMA_DIR] [train_steps N]
  86. options:
  87. --device_target device where the code will be implemented: "Ascend" | "GPU", default is "Ascend"
  88. --distribute pre_training by serveral devices: "true"(training by more than 1 device) | "false", default is "false"
  89. --epoch_size epoch size: N, default is 1
  90. --device_id device id: N, default is 0
  91. --device_num number of used devices: N, default is 1
  92. --save_ckpt_path path to save checkpoint files: PATH, default is ""
  93. --max_ckpt_num max number for saving checkpoint files: N, default is 1
  94. --do_shuffle enable shuffle: "true" | "false", default is "true"
  95. --enable_data_sink enable data sink: "true" | "false", default is "true"
  96. --data_sink_steps set data sink steps: N, default is 1
  97. --save_checkpoint_step steps for saving checkpoint files: N, default is 1000
  98. --load_teacher_ckpt_path path to load teacher checkpoint files: PATH, default is ""
  99. --data_dir path to dataset directory: PATH, default is ""
  100. --schema_dir path to schema.json file, PATH, default is ""
  101. ```
  102. ### Task Distill
  103. ```
  104. usage: run_general_task_ascend.py [--device_target DEVICE_TARGET] [--do_train DO_TRAIN] [--do_eval DO_EVAL]
  105. [--td_phase1_epoch_size N] [--td_phase2_epoch_size N]
  106. [--device_id N] [--do_shuffle DO_SHUFFLE]
  107. [--enable_data_sink ENABLE_DATA_SINK] [--save_ckpt_step N]
  108. [--max_ckpt_num N] [--data_sink_steps N]
  109. [--load_teacher_ckpt_path LOAD_TEACHER_CKPT_PATH]
  110. [--load_gd_ckpt_path LOAD_GD_CKPT_PATH]
  111. [--load_td1_ckpt_path LOAD_TD1_CKPT_PATH]
  112. [--train_data_dir TRAIN_DATA_DIR]
  113. [--eval_data_dir EVAL_DATA_DIR]
  114. [--task_name TASK_NAME] [--schema_dir SCHEMA_DIR]
  115. options:
  116. --device_target device where the code will be implemented: "Ascend" | "GPU", default is "Ascend"
  117. --do_train enable train task: "true" | "false", default is "true"
  118. --do_eval enable eval task: "true" | "false", default is "true"
  119. --td_phase1_epoch_size epoch size for td phase1: N, default is 10
  120. --td_phase2_epoch_size epoch size for td phase2: N, default is 3
  121. --device_id device id: N, default is 0
  122. --do_shuffle enable shuffle: "true" | "false", default is "true"
  123. --enable_data_sink enable data sink: "true" | "false", default is "true"
  124. --save_ckpt_step steps for saving checkpoint files: N, default is 1000
  125. --max_ckpt_num max number for saving checkpoint files: N, default is 1
  126. --data_sink_steps set data sink steps: N, default is 1
  127. --load_teacher_ckpt_path path to load teacher checkpoint files: PATH, default is ""
  128. --load_gd_ckpt_path path to load checkpoint files which produced by general distill: PATH, default is ""
  129. --load_td1_ckpt_path path to load checkpoint files which produced by task distill phase 1: PATH, default is ""
  130. --train_data_dir path to train dataset directory: PATH, default is ""
  131. --eval_data_dir path to eval dataset directory: PATH, default is ""
  132. --task_name classification task: "SST-2" | "QNLI" | "MNLI", default is ""
  133. --schema_dir path to schema.json file, PATH, default is ""
  134. ```
  135. ## Options and Parameters
  136. `gd_config.py` and `td_config.py` contain parameters of BERT model and options for optimizer and lossscale.
  137. ### Options:
  138. ```
  139. Parameters for lossscale:
  140. loss_scale_value initial value of loss scale: N, default is 2^8
  141. scale_factor factor used to update loss scale: N, default is 2
  142. scale_window steps for once updatation of loss scale: N, default is 50
  143. Parameters for optimizer:
  144. learning_rate value of learning rate: Q
  145. end_learning_rate value of end learning rate: Q, must be positive
  146. power power: Q
  147. weight_decay weight decay: Q
  148. eps term added to the denominator to improve numerical stability: Q
  149. ```
  150. ### Parameters:
  151. ```
  152. Parameters for bert network:
  153. batch_size batch size of input dataset: N, default is 16
  154. seq_length length of input sequence: N, default is 128
  155. vocab_size size of each embedding vector: N, must be consistant with the dataset you use. Default is 30522
  156. hidden_size size of bert encoder layers: N
  157. num_hidden_layers number of hidden layers: N
  158. num_attention_heads number of attention heads: N, default is 12
  159. intermediate_size size of intermediate layer: N
  160. hidden_act activation function used: ACTIVATION, default is "gelu"
  161. hidden_dropout_prob dropout probability for BertOutput: Q
  162. attention_probs_dropout_prob dropout probability for BertAttention: Q
  163. max_position_embeddings maximum length of sequences: N, default is 512
  164. save_ckpt_step number for saving checkponit: N, default is 100
  165. max_ckpt_num maximum number for saving checkpoint: N, default is 1
  166. type_vocab_size size of token type vocab: N, default is 2
  167. initializer_range initialization value of TruncatedNormal: Q, default is 0.02
  168. use_relative_positions use relative positions or not: True | False, default is False
  169. input_mask_from_dataset use the input mask loaded form dataset or not: True | False, default is True
  170. token_type_ids_from_dataset use the token type ids loaded from dataset or not: True | False, default is True
  171. dtype data type of input: mstype.float16 | mstype.float32, default is mstype.float32
  172. compute_type compute type in BertTransformer: mstype.float16 | mstype.float32, default is mstype.float16
  173. enable_fused_layernorm use batchnorm instead of layernorm to improve performance, default is False
  174. ```
  175. ## [Training Process](#contents)
  176. ### Training
  177. #### running on Ascend
  178. Before running the command below, please check `load_teacher_ckpt_path`, `data_dir` and `schma_dir` has been set. Please set the path to be the absolute full path, e.g:"/username/checkpoint_100_300.ckpt".
  179. ```
  180. bash scripts/run_standalone_gd_ascend.sh
  181. ```
  182. The command above will run in the background, you can view the results the file log.txt. After training, you will get some checkpoint files under the script folder by default. The loss value will be achieved as follows:
  183. ```
  184. # grep "epoch" log.txt
  185. epoch: 1, step: 100, outpus are (Tensor(shape=[1], dtype=Float32, 28.2093), Tensor(shape=[], dtype=Bool, False), Tensor(shape=[], dtype=Float32, 65536))
  186. epoch: 2, step: 200, outpus are (Tensor(shape=[1], dtype=Float32, 30.1724), Tensor(shape=[], dtype=Bool, False), Tensor(shape=[], dtype=Float32, 65536))
  187. ...
  188. ```
  189. ### Distributed Training
  190. #### running on Ascend
  191. Before running the command below, please check `load_teacher_ckpt_path`, `data_dir` and `schma_dir` has been set. Please set the path to be the absolute full path, e.g:"/username/checkpoint_100_300.ckpt".
  192. ```
  193. bash scripts/run_distributed_gd_ascend.sh 8 1 /path/hccl.json
  194. ```
  195. The command above will run in the background, you can view the results the file log.txt. After training, you will get some checkpoint files under the LOG* folder by default. The loss value will be achieved as follows:
  196. ```
  197. # grep "epoch" LOG*/log.txt
  198. epoch: 1, step: 100, outpus are (Tensor(shape=[1], dtype=Float32, 28.1478), Tensor(shape=[], dtype=Bool, False), Tensor(shape=[], dtype=Float32, 65536))
  199. ...
  200. epoch: 1, step: 100, outpus are (Tensor(shape=[1], dtype=Float32, 30.5901), Tensor(shape=[], dtype=Bool, False), Tensor(shape=[], dtype=Float32, 65536))
  201. ...
  202. ```
  203. ## [Evaluation Process](#contents)
  204. ### Evaluation
  205. If you want to after running and continue to eval, please set `do_train=true` and `do_eval=true`, If you want to run eval alone, please set `do_train=false` and `do_eval=true`.
  206. #### evaluation on SST-2 dataset when running on Ascend
  207. ```
  208. bash scripts/run_standalone_td_ascend.sh
  209. ```
  210. The command above will run in the background, you can view the results the file log.txt. The accuracy of the test dataset will be as follows:
  211. ```bash
  212. # grep "The best acc" log.txt
  213. The best acc is 0.872685
  214. The best acc is 0.893515
  215. The best acc is 0.899305
  216. ...
  217. The best acc is 0.902777
  218. ...
  219. ```
  220. #### evaluation on MNLI dataset when running on Ascend
  221. Before running the command below, please check the load pretrain checkpoint path has been set. Please set the checkpoint path to be the absolute full path, e.g:"/username/pretrain/checkpoint_100_300.ckpt".
  222. ```
  223. bash scripts/run_standalone_td_ascend.sh
  224. ```
  225. The command above will run in the background, you can view the results the file log.txt. The accuracy of the test dataset will be as follows:
  226. ```
  227. # grep "The best acc" log.txt
  228. The best acc is 0.803206
  229. The best acc is 0.803308
  230. The best acc is 0.810355
  231. ...
  232. The best acc is 0.813929
  233. ...
  234. ```
  235. #### evaluation on QNLI dataset when running on Ascend
  236. Before running the command below, please check the load pretrain checkpoint path has been set. Please set the checkpoint path to be the absolute full path, e.g:"/username/pretrain/checkpoint_100_300.ckpt".
  237. ```
  238. bash scripts/run_standalone_td_ascend.sh
  239. ```
  240. The command above will run in the background, you can view the results the file log.txt. The accuracy of the test dataset will be as follows:
  241. ```
  242. # grep "The best acc" log.txt
  243. The best acc is 0.870772
  244. The best acc is 0.871691
  245. The best acc is 0.875183
  246. ...
  247. The best acc is 0.891176
  248. ...
  249. ```
  250. ## [Model Description](#contents)
  251. ## [Performance](#contents)
  252. ### training Performance
  253. | Parameters | TinyBERT | TinyBERT |
  254. | -------------------------- | ---------------------------------------------------------- | ------------------------- |
  255. | Model Version | | |
  256. | Resource | Ascend 910, cpu:2.60GHz 56cores, memory:314G | NV SMX2 V100-32G |
  257. | uploaded Date | 08/20/2020 | 05/06/2020 |
  258. | MindSpore Version | 0.6.0 | 0.3.0 |
  259. | Dataset | cn-wiki-128 | ImageNet |
  260. | Training Parameters | src/gd_config.py | src/config.py |
  261. | Optimizer | AdamWeightDecay | AdamWeightDecay |
  262. | Loss Function | SoftmaxCrossEntropy | SoftmaxCrossEntropy |
  263. | outputs | probability | |
  264. | Loss | 6.541583 | 1.913 |
  265. | Speed | 35.4ms/step | |
  266. | Total time | 17.3h | |
  267. | Params (M) | 15M | |
  268. | Checkpoint for task distill| 74M(.ckpt file) | |
  269. #### Inference Performance
  270. | Parameters | | | |
  271. | -------------------------- | ----------------------------- | ------------------------- | -------------------- |
  272. | Model Version | | | |
  273. | Resource | Huawei 910 | NV SMX2 V100-32G | Huawei 310 |
  274. | uploaded Date | 08/20/2020 | 05/22/2020 | |
  275. | MindSpore Version | 0.6.0 | 0.2.0 | 0.2.0 |
  276. | Dataset | SST-2, | ImageNet, 1.2W | ImageNet, 1.2W |
  277. | batch_size | 32 | 130(8P) | |
  278. | Accuracy | 0.902777 | ACC1[72.07%] ACC5[90.90%] | |
  279. | Speed | | | |
  280. | Total time | | | |
  281. | Model for inference | 74M(.ckpt file) | | |
  282. # [Description of Random Situation](#contents)
  283. In run_standaloned_td.sh, we set do_shuffle to shuffle the dataset.
  284. In gd_config.py and td_config.py, we set the hidden_dropout_prob and attention_pros_dropout_prob to dropout some network node.
  285. In run_general_distill.py, we set the random seed to make sure distribute training has the same init weight.
  286. # [ModelZoo Homepage](#contents)
  287. Please check the official [homepage](https://gitee.com/mindspore/mindspore/tree/master/model_zoo).