You can not select more than 25 topics Topics must start with a chinese character,a letter or number, can include dashes ('-') and can be up to 35 characters long.

README.md 24 kB

5 years ago
5 years ago
5 years ago
5 years ago
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450
  1. # Contents
  2. - [Contents](#contents)
  3. - [TinyBERT Description](#tinybert-description)
  4. - [Model Architecture](#model-architecture)
  5. - [Dataset](#dataset)
  6. - [Environment Requirements](#environment-requirements)
  7. - [Quick Start](#quick-start)
  8. - [Script Description](#script-description)
  9. - [Script and Sample Code](#script-and-sample-code)
  10. - [Script Parameters](#script-parameters)
  11. - [General Distill](#general-distill)
  12. - [Task Distill](#task-distill)
  13. - [Options and Parameters](#options-and-parameters)
  14. - [Options:](#options)
  15. - [Parameters:](#parameters)
  16. - [Training Process](#training-process)
  17. - [Training](#training)
  18. - [running on Ascend](#running-on-ascend)
  19. - [running on GPU](#running-on-gpu)
  20. - [Distributed Training](#distributed-training)
  21. - [running on Ascend](#running-on-ascend-1)
  22. - [running on GPU](#running-on-gpu-1)
  23. - [Evaluation Process](#evaluation-process)
  24. - [Evaluation](#evaluation)
  25. - [evaluation on SST-2 dataset](#evaluation-on-sst-2-dataset)
  26. - [evaluation on MNLI dataset](#evaluation-on-mnli-dataset)
  27. - [evaluation on QNLI dataset](#evaluation-on-qnli-dataset)
  28. - [Model Description](#model-description)
  29. - [Performance](#performance)
  30. - [training Performance](#training-performance)
  31. - [Inference Performance](#inference-performance)
  32. - [Description of Random Situation](#description-of-random-situation)
  33. - [ModelZoo Homepage](#modelzoo-homepage)
  34. # [TinyBERT Description](#contents)
  35. [TinyBERT](https://github.com/huawei-noah/Pretrained-Language-Model/tree/master/TinyBERT) is 7.5x smalller and 9.4x faster on inference than [BERT-base](https://github.com/google-research/bert) (the base version of BERT model) and achieves competitive performances in the tasks of natural language understanding. It performs a novel transformer distillation at both the pre-training and task-specific learning stages.
  36. [Paper](https://arxiv.org/abs/1909.10351): Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, Qun Liu. [TinyBERT: Distilling BERT for Natural Language Understanding](https://arxiv.org/abs/1909.10351). arXiv preprint arXiv:1909.10351.
  37. # [Model Architecture](#contents)
  38. The backbone structure of TinyBERT is transformer, the transformer contains four encoder modules, one encoder contains one selfattention module and one selfattention module contains one attention module.
  39. # [Dataset](#contents)
  40. - Download the zhwiki or enwiki dataset for general distillation. Extract and clean text in the dataset with [WikiExtractor](https://github.com/attardi/wikiextractor). Convert the dataset to TFRecord format, please refer to create_pretraining_data.py which in [BERT](https://github.com/google-research/bert) repository.
  41. - Download glue dataset for task distillation. Convert dataset files from json format to tfrecord format, please refer to run_classifier.py which in [BERT](https://github.com/google-research/bert) repository.
  42. # [Environment Requirements](#contents)
  43. - Hardware(Ascend/GPU)
  44. - Prepare hardware environment with Ascend or GPU processor. If you want to try Ascend, please send the [application form](https://obs-9be7.obs.cn-east-2.myhuaweicloud.com/file/other/Ascend%20Model%20Zoo%E4%BD%93%E9%AA%8C%E8%B5%84%E6%BA%90%E7%94%B3%E8%AF%B7%E8%A1%A8.docx) to ascend@huawei.com. Once approved, you can get the resources.
  45. - Framework
  46. - [MindSpore](https://gitee.com/mindspore/mindspore)
  47. - For more information, please check the resources below:
  48. - [MindSpore Tutorials](https://www.mindspore.cn/tutorial/training/en/master/index.html)
  49. - [MindSpore Python API](https://www.mindspore.cn/doc/api_python/en/master/index.html)
  50. # [Quick Start](#contents)
  51. After installing MindSpore via the official website, you can start general distill, task distill and evaluation as follows:
  52. ```text
  53. # run standalone general distill example
  54. bash scripts/run_standalone_gd.sh
  55. Before running the shell script, please set the `load_teacher_ckpt_path`, `data_dir`, `schema_dir` and `dataset_type` in the run_standalone_gd.sh file first. If running on GPU, please set the `device_target=GPU`.
  56. # For Ascend device, run distributed general distill example
  57. bash scripts/run_distributed_gd_ascend.sh 8 1 /path/hccl.json
  58. Before running the shell script, please set the `load_teacher_ckpt_path`, `data_dir`, `schema_dir` and `dataset_type` in the run_distributed_gd_ascend.sh file first.
  59. # For GPU device, run distributed general distill example
  60. bash scripts/run_distributed_gd_gpu.sh 8 1 /path/data/ /path/schema.json /path/teacher.ckpt
  61. # run task distill and evaluation example
  62. bash scripts/run_standalone_td.sh
  63. Before running the shell script, please set the `task_name`, `load_teacher_ckpt_path`, `load_gd_ckpt_path`, `train_data_dir`, `eval_data_dir`, `schema_dir` and `dataset_type` in the run_standalone_td.sh file first.
  64. If running on GPU, please set the `device_target=GPU`.
  65. ```
  66. For distributed training on Ascend, a hccl configuration file with JSON format needs to be created in advance.
  67. Please follow the instructions in the link below:
  68. https:gitee.com/mindspore/mindspore/tree/master/model_zoo/utils/hccl_tools.
  69. For dataset, if you want to set the format and parameters, a schema configuration file with JSON format needs to be created, please refer to [tfrecord](https://www.mindspore.cn/doc/programming_guide/en/master/dataset_loading.html#tfrecord) format.
  70. ```text
  71. For general task, schema file contains ["input_ids", "input_mask", "segment_ids"].
  72. For task distill and eval phase, schema file contains ["input_ids", "input_mask", "segment_ids", "label_ids"].
  73. `numRows` is the only option which could be set by user, the others value must be set according to the dataset.
  74. For example, the dataset is cn-wiki-128, the schema file for general distill phase as following:
  75. {
  76. "datasetType": "TF",
  77. "numRows": 7680,
  78. "columns": {
  79. "input_ids": {
  80. "type": "int64",
  81. "rank": 1,
  82. "shape": [256]
  83. },
  84. "input_mask": {
  85. "type": "int64",
  86. "rank": 1,
  87. "shape": [256]
  88. },
  89. "segment_ids": {
  90. "type": "int64",
  91. "rank": 1,
  92. "shape": [256]
  93. }
  94. }
  95. }
  96. ```
  97. # [Script Description](#contents)
  98. ## [Script and Sample Code](#contents)
  99. ```shell
  100. .
  101. └─bert
  102. ├─README.md
  103. ├─scripts
  104. ├─run_distributed_gd_ascend.sh # shell script for distributed general distill phase on Ascend
  105. ├─run_distributed_gd_gpu.sh # shell script for distributed general distill phase on GPU
  106. ├─run_standalone_gd.sh # shell script for standalone general distill phase
  107. ├─run_standalone_td.sh # shell script for standalone task distill phase
  108. ├─src
  109. ├─__init__.py
  110. ├─assessment_method.py # assessment method for evaluation
  111. ├─dataset.py # data processing
  112. ├─gd_config.py # parameter configuration for general distill phase
  113. ├─td_config.py # parameter configuration for task distill phase
  114. ├─tinybert_for_gd_td.py # backbone code of network
  115. ├─tinybert_model.py # backbone code of network
  116. ├─utils.py # util function
  117. ├─__init__.py
  118. ├─run_general_distill.py # train net for general distillation
  119. ├─run_task_distill.py # train and eval net for task distillation
  120. ```
  121. ## [Script Parameters](#contents)
  122. ### General Distill
  123. ```text
  124. usage: run_general_distill.py [--distribute DISTRIBUTE] [--epoch_size N] [----device_num N] [--device_id N]
  125. [--device_target DEVICE_TARGET] [--do_shuffle DO_SHUFFLE]
  126. [--enable_data_sink ENABLE_DATA_SINK] [--data_sink_steps N]
  127. [--save_ckpt_path SAVE_CKPT_PATH]
  128. [--load_teacher_ckpt_path LOAD_TEACHER_CKPT_PATH]
  129. [--save_checkpoint_step N] [--max_ckpt_num N]
  130. [--data_dir DATA_DIR] [--schema_dir SCHEMA_DIR] [--dataset_type DATASET_TYPE] [train_steps N]
  131. options:
  132. --device_target device where the code will be implemented: "Ascend" | "GPU", default is "Ascend"
  133. --distribute pre_training by serveral devices: "true"(training by more than 1 device) | "false", default is "false"
  134. --epoch_size epoch size: N, default is 1
  135. --device_id device id: N, default is 0
  136. --device_num number of used devices: N, default is 1
  137. --save_ckpt_path path to save checkpoint files: PATH, default is ""
  138. --max_ckpt_num max number for saving checkpoint files: N, default is 1
  139. --do_shuffle enable shuffle: "true" | "false", default is "true"
  140. --enable_data_sink enable data sink: "true" | "false", default is "true"
  141. --data_sink_steps set data sink steps: N, default is 1
  142. --save_checkpoint_step steps for saving checkpoint files: N, default is 1000
  143. --load_teacher_ckpt_path path to load teacher checkpoint files: PATH, default is ""
  144. --data_dir path to dataset directory: PATH, default is ""
  145. --schema_dir path to schema.json file, PATH, default is ""
  146. --dataset_type the dataset type which can be tfrecord/mindrecord, default is tfrecord
  147. ```
  148. ### Task Distill
  149. ```text
  150. usage: run_general_task.py [--device_target DEVICE_TARGET] [--do_train DO_TRAIN] [--do_eval DO_EVAL]
  151. [--td_phase1_epoch_size N] [--td_phase2_epoch_size N]
  152. [--device_id N] [--do_shuffle DO_SHUFFLE]
  153. [--enable_data_sink ENABLE_DATA_SINK] [--save_ckpt_step N]
  154. [--max_ckpt_num N] [--data_sink_steps N]
  155. [--load_teacher_ckpt_path LOAD_TEACHER_CKPT_PATH]
  156. [--load_gd_ckpt_path LOAD_GD_CKPT_PATH]
  157. [--load_td1_ckpt_path LOAD_TD1_CKPT_PATH]
  158. [--train_data_dir TRAIN_DATA_DIR]
  159. [--eval_data_dir EVAL_DATA_DIR]
  160. [--task_name TASK_NAME] [--schema_dir SCHEMA_DIR] [--dataset_type DATASET_TYPE]
  161. options:
  162. --device_target device where the code will be implemented: "Ascend" | "GPU", default is "Ascend"
  163. --do_train enable train task: "true" | "false", default is "true"
  164. --do_eval enable eval task: "true" | "false", default is "true"
  165. --td_phase1_epoch_size epoch size for td phase1: N, default is 10
  166. --td_phase2_epoch_size epoch size for td phase2: N, default is 3
  167. --device_id device id: N, default is 0
  168. --do_shuffle enable shuffle: "true" | "false", default is "true"
  169. --enable_data_sink enable data sink: "true" | "false", default is "true"
  170. --save_ckpt_step steps for saving checkpoint files: N, default is 1000
  171. --max_ckpt_num max number for saving checkpoint files: N, default is 1
  172. --data_sink_steps set data sink steps: N, default is 1
  173. --load_teacher_ckpt_path path to load teacher checkpoint files: PATH, default is ""
  174. --load_gd_ckpt_path path to load checkpoint files which produced by general distill: PATH, default is ""
  175. --load_td1_ckpt_path path to load checkpoint files which produced by task distill phase 1: PATH, default is ""
  176. --train_data_dir path to train dataset directory: PATH, default is ""
  177. --eval_data_dir path to eval dataset directory: PATH, default is ""
  178. --task_name classification task: "SST-2" | "QNLI" | "MNLI", default is ""
  179. --schema_dir path to schema.json file, PATH, default is ""
  180. --dataset_type the dataset type which can be tfrecord/mindrecord, default is tfrecord
  181. ```
  182. ## Options and Parameters
  183. `gd_config.py` and `td_config.py` contain parameters of BERT model and options for optimizer and lossscale.
  184. ### Options
  185. ```text
  186. batch_size batch size of input dataset: N, default is 16
  187. Parameters for lossscale:
  188. loss_scale_value initial value of loss scale: N, default is 2^8
  189. scale_factor factor used to update loss scale: N, default is 2
  190. scale_window steps for once updatation of loss scale: N, default is 50
  191. Parameters for optimizer:
  192. learning_rate value of learning rate: Q
  193. end_learning_rate value of end learning rate: Q, must be positive
  194. power power: Q
  195. weight_decay weight decay: Q
  196. eps term added to the denominator to improve numerical stability: Q
  197. ```
  198. ### Parameters
  199. ```text
  200. Parameters for bert network:
  201. seq_length length of input sequence: N, default is 128
  202. vocab_size size of each embedding vector: N, must be consistant with the dataset you use. Default is 30522
  203. hidden_size size of bert encoder layers: N
  204. num_hidden_layers number of hidden layers: N
  205. num_attention_heads number of attention heads: N, default is 12
  206. intermediate_size size of intermediate layer: N
  207. hidden_act activation function used: ACTIVATION, default is "gelu"
  208. hidden_dropout_prob dropout probability for BertOutput: Q
  209. attention_probs_dropout_prob dropout probability for BertAttention: Q
  210. max_position_embeddings maximum length of sequences: N, default is 512
  211. save_ckpt_step number for saving checkponit: N, default is 100
  212. max_ckpt_num maximum number for saving checkpoint: N, default is 1
  213. type_vocab_size size of token type vocab: N, default is 2
  214. initializer_range initialization value of TruncatedNormal: Q, default is 0.02
  215. use_relative_positions use relative positions or not: True | False, default is False
  216. dtype data type of input: mstype.float16 | mstype.float32, default is mstype.float32
  217. compute_type compute type in BertTransformer: mstype.float16 | mstype.float32, default is mstype.float16
  218. ```
  219. ## [Training Process](#contents)
  220. ### Training
  221. #### running on Ascend
  222. Before running the command below, please check `load_teacher_ckpt_path`, `data_dir` and `schma_dir` has been set. Please set the path to be the absolute full path, e.g:"/username/checkpoint_100_300.ckpt".
  223. ```bash
  224. bash scripts/run_standalone_gd.sh
  225. ```
  226. The command above will run in the background, you can view the results the file log.txt. After training, you will get some checkpoint files under the script folder by default. The loss value will be achieved as follows:
  227. ```text
  228. # grep "epoch" log.txt
  229. epoch: 1, step: 100, outpus are (Tensor(shape=[1], dtype=Float32, 28.2093), Tensor(shape=[], dtype=Bool, False), Tensor(shape=[], dtype=Float32, 65536))
  230. epoch: 2, step: 200, outpus are (Tensor(shape=[1], dtype=Float32, 30.1724), Tensor(shape=[], dtype=Bool, False), Tensor(shape=[], dtype=Float32, 65536))
  231. ...
  232. ```
  233. > **Attention** This will bind the processor cores according to the `device_num` and total processor numbers. If you don't expect to run pretraining with binding processor cores, remove the operations about `taskset` in `scripts/run_distributed_gd_ascend.sh`
  234. #### running on GPU
  235. Before running the command below, please check `load_teacher_ckpt_path`, `data_dir` `schma_dir` and `device_target=GPU` has been set. Please set the path to be the absolute full path, e.g:"/username/checkpoint_100_300.ckpt".
  236. ```bash
  237. bash scripts/run_standalone_gd.sh
  238. ```
  239. The command above will run in the background, you can view the results the file log.txt. After training, you will get some checkpoint files under the script folder by default. The loss value will be achieved as follows:
  240. ```text
  241. # grep "epoch" log.txt
  242. epoch: 1, step: 100, outpus are 28.2093
  243. ...
  244. ```
  245. ### Distributed Training
  246. #### running on Ascend
  247. Before running the command below, please check `load_teacher_ckpt_path`, `data_dir` and `schma_dir` has been set. Please set the path to be the absolute full path, e.g:"/username/checkpoint_100_300.ckpt".
  248. ```bash
  249. bash scripts/run_distributed_gd_ascend.sh 8 1 /path/hccl.json
  250. ```
  251. The command above will run in the background, you can view the results the file log.txt. After training, you will get some checkpoint files under the LOG* folder by default. The loss value will be achieved as follows:
  252. ```text
  253. # grep "epoch" LOG*/log.txt
  254. epoch: 1, step: 100, outpus are (Tensor(shape=[1], dtype=Float32, 28.1478), Tensor(shape=[], dtype=Bool, False), Tensor(shape=[], dtype=Float32, 65536))
  255. ...
  256. epoch: 1, step: 100, outpus are (Tensor(shape=[1], dtype=Float32, 30.5901), Tensor(shape=[], dtype=Bool, False), Tensor(shape=[], dtype=Float32, 65536))
  257. ...
  258. ```
  259. #### running on GPU
  260. Please input the path to be the absolute full path, e.g:"/username/checkpoint_100_300.ckpt".
  261. ```bash
  262. bash scripts/run_distributed_gd_gpu.sh 8 1 /path/data/ /path/schema.json /path/teacher.ckpt
  263. ```
  264. The command above will run in the background, you can view the results the file log.txt. After training, you will get some checkpoint files under the LOG* folder by default. The loss value will be achieved as follows:
  265. ```text
  266. # grep "epoch" LOG*/log.txt
  267. epoch: 1, step: 1, outpus are 63.4098
  268. ...
  269. ```
  270. ## [Evaluation Process](#contents)
  271. ### Evaluation
  272. If you want to after running and continue to eval, please set `do_train=true` and `do_eval=true`, If you want to run eval alone, please set `do_train=false` and `do_eval=true`. If running on GPU, please set `device_target=GPU`.
  273. #### evaluation on SST-2 dataset
  274. ```bash
  275. bash scripts/run_standalone_td.sh
  276. ```
  277. The command above will run in the background, you can view the results the file log.txt. The accuracy of the test dataset will be as follows:
  278. ```bash
  279. # grep "The best acc" log.txt
  280. The best acc is 0.872685
  281. The best acc is 0.893515
  282. The best acc is 0.899305
  283. ...
  284. The best acc is 0.902777
  285. ...
  286. ```
  287. #### evaluation on MNLI dataset
  288. Before running the command below, please check the load pretrain checkpoint path has been set. Please set the checkpoint path to be the absolute full path, e.g:"/username/pretrain/checkpoint_100_300.ckpt".
  289. ```bash
  290. bash scripts/run_standalone_td.sh
  291. ```
  292. The command above will run in the background, you can view the results the file log.txt. The accuracy of the test dataset will be as follows:
  293. ```text
  294. # grep "The best acc" log.txt
  295. The best acc is 0.803206
  296. The best acc is 0.803308
  297. The best acc is 0.810355
  298. ...
  299. The best acc is 0.813929
  300. ...
  301. ```
  302. #### evaluation on QNLI dataset
  303. Before running the command below, please check the load pretrain checkpoint path has been set. Please set the checkpoint path to be the absolute full path, e.g:"/username/pretrain/checkpoint_100_300.ckpt".
  304. ```bash
  305. bash scripts/run_standalone_td.sh
  306. ```
  307. The command above will run in the background, you can view the results the file log.txt. The accuracy of the test dataset will be as follows:
  308. ```text
  309. # grep "The best acc" log.txt
  310. The best acc is 0.870772
  311. The best acc is 0.871691
  312. The best acc is 0.875183
  313. ...
  314. The best acc is 0.891176
  315. ...
  316. ```
  317. ## [Model Description](#contents)
  318. ## [Performance](#contents)
  319. ### training Performance
  320. | Parameters | Ascend | GPU |
  321. | -------------------------- | ---------------------------------------------------------- | ------------------------- |
  322. | Model Version | TinyBERT | TinyBERT |
  323. | Resource | Ascend 910, cpu:2.60GHz 192cores, memory:755G | NV SMX2 V100-32G, cpu:2.10GHz 64cores, memory:251G |
  324. | uploaded Date | 08/20/2020 | 08/24/2020 |
  325. | MindSpore Version | 1.0.0 | 1.0.0 |
  326. | Dataset | cn-wiki-128 | cn-wiki-128 |
  327. | Training Parameters | src/gd_config.py | src/gd_config.py |
  328. | Optimizer | AdamWeightDecay | AdamWeightDecay |
  329. | Loss Function | SoftmaxCrossEntropy | SoftmaxCrossEntropy |
  330. | outputs | probability | probability |
  331. | Loss | 6.541583 | 6.6915 |
  332. | Speed | 35.4ms/step | 98.654ms/step |
  333. | Total time | 17.3h(3poch, 8p) | 48h(3poch, 8p) |
  334. | Params (M) | 15M | 15M |
  335. | Checkpoint for task distill| 74M(.ckpt file) | 74M(.ckpt file) |
  336. | Scripts | [TinyBERT](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/official/nlp/tinybert) | |
  337. #### Inference Performance
  338. | Parameters | Ascend | GPU |
  339. | -------------------------- | ----------------------------- | ------------------------- |
  340. | Model Version | | |
  341. | Resource | Ascend 910 | NV SMX2 V100-32G |
  342. | uploaded Date | 08/20/2020 | 08/24/2020 |
  343. | MindSpore Version | 1.0.0 | 1.0.0 |
  344. | Dataset | SST-2, | SST-2 |
  345. | batch_size | 32 | 32 |
  346. | Accuracy | 0.902777 | 0.9086 |
  347. | Speed | | |
  348. | Total time | | |
  349. | Model for inference | 74M(.ckpt file) | 74M(.ckpt file) |
  350. # [Description of Random Situation](#contents)
  351. In run_standaloned_td.sh, we set do_shuffle to shuffle the dataset.
  352. In gd_config.py and td_config.py, we set the hidden_dropout_prob and attention_pros_dropout_prob to dropout some network node.
  353. In run_general_distill.py, we set the random seed to make sure distribute training has the same init weight.
  354. # [ModelZoo Homepage](#contents)
  355. Please check the official [homepage](https://gitee.com/mindspore/mindspore/tree/master/model_zoo).