You can not select more than 25 topics Topics must start with a chinese character,a letter or number, can include dashes ('-') and can be up to 35 characters long.

README.md 24 kB

5 years ago
5 years ago
5 years ago
5 years ago
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461
  1. # Contents
  2. - [Contents](#contents)
  3. - [TinyBERT Description](#tinybert-description)
  4. - [Model Architecture](#model-architecture)
  5. - [Dataset](#dataset)
  6. - [Environment Requirements](#environment-requirements)
  7. - [Quick Start](#quick-start)
  8. - [Script Description](#script-description)
  9. - [Script and Sample Code](#script-and-sample-code)
  10. - [Script Parameters](#script-parameters)
  11. - [General Distill](#general-distill)
  12. - [Task Distill](#task-distill)
  13. - [Options and Parameters](#options-and-parameters)
  14. - [Options:](#options)
  15. - [Parameters:](#parameters)
  16. - [Training Process](#training-process)
  17. - [Training](#training)
  18. - [running on Ascend](#running-on-ascend)
  19. - [running on GPU](#running-on-gpu)
  20. - [Distributed Training](#distributed-training)
  21. - [running on Ascend](#running-on-ascend-1)
  22. - [running on GPU](#running-on-gpu-1)
  23. - [Evaluation Process](#evaluation-process)
  24. - [Evaluation](#evaluation)
  25. - [evaluation on SST-2 dataset](#evaluation-on-sst-2-dataset)
  26. - [evaluation on MNLI dataset](#evaluation-on-mnli-dataset)
  27. - [evaluation on QNLI dataset](#evaluation-on-qnli-dataset)
  28. - [Model Description](#model-description)
  29. - [Performance](#performance)
  30. - [training Performance](#training-performance)
  31. - [Inference Performance](#inference-performance)
  32. - [Description of Random Situation](#description-of-random-situation)
  33. - [ModelZoo Homepage](#modelzoo-homepage)
  34. # [TinyBERT Description](#contents)
  35. [TinyBERT](https://github.com/huawei-noah/Pretrained-Language-Model/tree/master/TinyBERT) is 7.5x smalller and 9.4x faster on inference than [BERT-base](https://github.com/google-research/bert) (the base version of BERT model) and achieves competitive performances in the tasks of natural language understanding. It performs a novel transformer distillation at both the pre-training and task-specific learning stages.
  36. [Paper](https://arxiv.org/abs/1909.10351): Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, Qun Liu. [TinyBERT: Distilling BERT for Natural Language Understanding](https://arxiv.org/abs/1909.10351). arXiv preprint arXiv:1909.10351.
  37. # [Model Architecture](#contents)
  38. The backbone structure of TinyBERT is transformer, the transformer contains four encoder modules, one encoder contains one selfattention module and one selfattention module contains one attention module.
  39. # [Dataset](#contents)
  40. - Create dataset for general distill phase
  41. - Download the [zhwiki](https://dumps.wikimedia.org/zhwiki/) or [enwiki](https://dumps.wikimedia.org/enwiki/) dataset for pre-training.
  42. - Extract and refine texts in the dataset with [WikiExtractor](https://github.com/attardi/wikiextractor). The commands are as follows:
  43. - pip install wikiextractor
  44. - python -m wikiextractor.WikiExtractor -o <output file path> -b <output file size> <Wikipedia dump file>
  45. - Convert the dataset to TFRecord format. Please refer to create_pretraining_data.py file in [BERT](https://github.com/google-research/bert) repository and download vocab.txt here, if AttributeError: module 'tokenization' has no attribute 'FullTokenizer' occur, please install bert-tensorflow.
  46. - Create dataset for task distill phase
  47. - Download [GLUE](https://github.com/nyu-mll/GLUE-baselines) dataset for task distill phase
  48. - Convert dataset files from JSON format to TFRECORD format, please refer to run_classifier.py file in [BERT](https://github.com/google-research/bert) repository.
  49. # [Environment Requirements](#contents)
  50. - Hardware(Ascend/GPU)
  51. - Prepare hardware environment with Ascend or GPU processor.
  52. - Framework
  53. - [MindSpore](https://gitee.com/mindspore/mindspore)
  54. - For more information, please check the resources below:
  55. - [MindSpore Tutorials](https://www.mindspore.cn/tutorial/training/en/master/index.html)
  56. - [MindSpore Python API](https://www.mindspore.cn/doc/api_python/en/master/index.html)
  57. # [Quick Start](#contents)
  58. After installing MindSpore via the official website, you can start general distill, task distill and evaluation as follows:
  59. ```text
  60. # run standalone general distill example
  61. bash scripts/run_standalone_gd.sh
  62. Before running the shell script, please set the `load_teacher_ckpt_path`, `data_dir`, `schema_dir` and `dataset_type` in the run_standalone_gd.sh file first. If running on GPU, please set the `device_target=GPU`.
  63. # For Ascend device, run distributed general distill example
  64. bash scripts/run_distributed_gd_ascend.sh 8 1 /path/hccl.json
  65. Before running the shell script, please set the `load_teacher_ckpt_path`, `data_dir`, `schema_dir` and `dataset_type` in the run_distributed_gd_ascend.sh file first.
  66. # For GPU device, run distributed general distill example
  67. bash scripts/run_distributed_gd_gpu.sh 8 1 /path/data/ /path/schema.json /path/teacher.ckpt
  68. # run task distill and evaluation example
  69. bash scripts/run_standalone_td.sh
  70. Before running the shell script, please set the `task_name`, `load_teacher_ckpt_path`, `load_gd_ckpt_path`, `train_data_dir`, `eval_data_dir`, `schema_dir` and `dataset_type` in the run_standalone_td.sh file first.
  71. If running on GPU, please set the `device_target=GPU`.
  72. ```
  73. For distributed training on Ascend, a hccl configuration file with JSON format needs to be created in advance.
  74. Please follow the instructions in the link below:
  75. https:gitee.com/mindspore/mindspore/tree/master/model_zoo/utils/hccl_tools.
  76. For dataset, if you want to set the format and parameters, a schema configuration file with JSON format needs to be created, please refer to [tfrecord](https://www.mindspore.cn/doc/programming_guide/en/master/dataset_loading.html#tfrecord) format.
  77. ```text
  78. For general task, schema file contains ["input_ids", "input_mask", "segment_ids"].
  79. For task distill and eval phase, schema file contains ["input_ids", "input_mask", "segment_ids", "label_ids"].
  80. `numRows` is the only option which could be set by user, the others value must be set according to the dataset.
  81. For example, the dataset is cn-wiki-128, the schema file for general distill phase as following:
  82. {
  83. "datasetType": "TF",
  84. "numRows": 7680,
  85. "columns": {
  86. "input_ids": {
  87. "type": "int64",
  88. "rank": 1,
  89. "shape": [256]
  90. },
  91. "input_mask": {
  92. "type": "int64",
  93. "rank": 1,
  94. "shape": [256]
  95. },
  96. "segment_ids": {
  97. "type": "int64",
  98. "rank": 1,
  99. "shape": [256]
  100. }
  101. }
  102. }
  103. ```
  104. # [Script Description](#contents)
  105. ## [Script and Sample Code](#contents)
  106. ```shell
  107. .
  108. └─bert
  109. ├─README.md
  110. ├─scripts
  111. ├─run_distributed_gd_ascend.sh # shell script for distributed general distill phase on Ascend
  112. ├─run_distributed_gd_gpu.sh # shell script for distributed general distill phase on GPU
  113. ├─run_standalone_gd.sh # shell script for standalone general distill phase
  114. ├─run_standalone_td.sh # shell script for standalone task distill phase
  115. ├─src
  116. ├─__init__.py
  117. ├─assessment_method.py # assessment method for evaluation
  118. ├─dataset.py # data processing
  119. ├─gd_config.py # parameter configuration for general distill phase
  120. ├─td_config.py # parameter configuration for task distill phase
  121. ├─tinybert_for_gd_td.py # backbone code of network
  122. ├─tinybert_model.py # backbone code of network
  123. ├─utils.py # util function
  124. ├─__init__.py
  125. ├─run_general_distill.py # train net for general distillation
  126. ├─run_task_distill.py # train and eval net for task distillation
  127. ```
  128. ## [Script Parameters](#contents)
  129. ### General Distill
  130. ```text
  131. usage: run_general_distill.py [--distribute DISTRIBUTE] [--epoch_size N] [----device_num N] [--device_id N]
  132. [--device_target DEVICE_TARGET] [--do_shuffle DO_SHUFFLE]
  133. [--enable_data_sink ENABLE_DATA_SINK] [--data_sink_steps N]
  134. [--save_ckpt_path SAVE_CKPT_PATH]
  135. [--load_teacher_ckpt_path LOAD_TEACHER_CKPT_PATH]
  136. [--save_checkpoint_step N] [--max_ckpt_num N]
  137. [--data_dir DATA_DIR] [--schema_dir SCHEMA_DIR] [--dataset_type DATASET_TYPE] [train_steps N]
  138. options:
  139. --device_target device where the code will be implemented: "Ascend" | "GPU", default is "Ascend"
  140. --distribute pre_training by several devices: "true"(training by more than 1 device) | "false", default is "false"
  141. --epoch_size epoch size: N, default is 1
  142. --device_id device id: N, default is 0
  143. --device_num number of used devices: N, default is 1
  144. --save_ckpt_path path to save checkpoint files: PATH, default is ""
  145. --max_ckpt_num max number for saving checkpoint files: N, default is 1
  146. --do_shuffle enable shuffle: "true" | "false", default is "true"
  147. --enable_data_sink enable data sink: "true" | "false", default is "true"
  148. --data_sink_steps set data sink steps: N, default is 1
  149. --save_checkpoint_step steps for saving checkpoint files: N, default is 1000
  150. --load_teacher_ckpt_path path to load teacher checkpoint files: PATH, default is ""
  151. --data_dir path to dataset directory: PATH, default is ""
  152. --schema_dir path to schema.json file, PATH, default is ""
  153. --dataset_type the dataset type which can be tfrecord/mindrecord, default is tfrecord
  154. ```
  155. ### Task Distill
  156. ```text
  157. usage: run_general_task.py [--device_target DEVICE_TARGET] [--do_train DO_TRAIN] [--do_eval DO_EVAL]
  158. [--td_phase1_epoch_size N] [--td_phase2_epoch_size N]
  159. [--device_id N] [--do_shuffle DO_SHUFFLE]
  160. [--enable_data_sink ENABLE_DATA_SINK] [--save_ckpt_step N]
  161. [--max_ckpt_num N] [--data_sink_steps N]
  162. [--load_teacher_ckpt_path LOAD_TEACHER_CKPT_PATH]
  163. [--load_gd_ckpt_path LOAD_GD_CKPT_PATH]
  164. [--load_td1_ckpt_path LOAD_TD1_CKPT_PATH]
  165. [--train_data_dir TRAIN_DATA_DIR]
  166. [--eval_data_dir EVAL_DATA_DIR] [--task_type TASK_TYPE]
  167. [--task_name TASK_NAME] [--schema_dir SCHEMA_DIR] [--dataset_type DATASET_TYPE]
  168. [--assessment_method ASSESSMENT_METHOD]
  169. options:
  170. --device_target device where the code will be implemented: "Ascend" | "GPU", default is "Ascend"
  171. --do_train enable train task: "true" | "false", default is "true"
  172. --do_eval enable eval task: "true" | "false", default is "true"
  173. --td_phase1_epoch_size epoch size for td phase1: N, default is 10
  174. --td_phase2_epoch_size epoch size for td phase2: N, default is 3
  175. --device_id device id: N, default is 0
  176. --do_shuffle enable shuffle: "true" | "false", default is "true"
  177. --enable_data_sink enable data sink: "true" | "false", default is "true"
  178. --save_ckpt_step steps for saving checkpoint files: N, default is 1000
  179. --max_ckpt_num max number for saving checkpoint files: N, default is 1
  180. --data_sink_steps set data sink steps: N, default is 1
  181. --load_teacher_ckpt_path path to load teacher checkpoint files: PATH, default is ""
  182. --load_gd_ckpt_path path to load checkpoint files which produced by general distill: PATH, default is ""
  183. --load_td1_ckpt_path path to load checkpoint files which produced by task distill phase 1: PATH, default is ""
  184. --train_data_dir path to train dataset directory: PATH, default is ""
  185. --eval_data_dir path to eval dataset directory: PATH, default is ""
  186. --task_type task type: "classification" | "ner", default is "classification"
  187. --task_name classification or ner task: "SST-2" | "QNLI" | "MNLI" | "TNEWS", "CLUENER", default is ""
  188. --assessment_method assessment method to do evaluation: acc | f1
  189. --schema_dir path to schema.json file, PATH, default is ""
  190. --dataset_type the dataset type which can be tfrecord/mindrecord, default is tfrecord
  191. ```
  192. ## Options and Parameters
  193. `gd_config.py` and `td_config.py` contain parameters of BERT model and options for optimizer and lossscale.
  194. ### Options
  195. ```text
  196. batch_size batch size of input dataset: N, default is 16
  197. Parameters for lossscale:
  198. loss_scale_value initial value of loss scale: N, default is 2^8
  199. scale_factor factor used to update loss scale: N, default is 2
  200. scale_window steps for once updatation of loss scale: N, default is 50
  201. Parameters for optimizer:
  202. learning_rate value of learning rate: Q
  203. end_learning_rate value of end learning rate: Q, must be positive
  204. power power: Q
  205. weight_decay weight decay: Q
  206. eps term added to the denominator to improve numerical stability: Q
  207. ```
  208. ### Parameters
  209. ```text
  210. Parameters for bert network:
  211. seq_length length of input sequence: N, default is 128
  212. vocab_size size of each embedding vector: N, must be consistent with the dataset you use. Default is 30522
  213. Usually, we use 21128 for CN vocabs and 30522 for EN vocabs according to the origin paper. Default is 30522
  214. hidden_size size of bert encoder layers: N
  215. num_hidden_layers number of hidden layers: N
  216. num_attention_heads number of attention heads: N, default is 12
  217. intermediate_size size of intermediate layer: N
  218. hidden_act activation function used: ACTIVATION, default is "gelu"
  219. hidden_dropout_prob dropout probability for BertOutput: Q
  220. attention_probs_dropout_prob dropout probability for BertAttention: Q
  221. max_position_embeddings maximum length of sequences: N, default is 512
  222. save_ckpt_step number for saving checkponit: N, default is 100
  223. max_ckpt_num maximum number for saving checkpoint: N, default is 1
  224. type_vocab_size size of token type vocab: N, default is 2
  225. initializer_range initialization value of TruncatedNormal: Q, default is 0.02
  226. use_relative_positions use relative positions or not: True | False, default is False
  227. dtype data type of input: mstype.float16 | mstype.float32, default is mstype.float32
  228. compute_type compute type in BertTransformer: mstype.float16 | mstype.float32, default is mstype.float16
  229. ```
  230. ## [Training Process](#contents)
  231. ### Training
  232. #### running on Ascend
  233. Before running the command below, please check `load_teacher_ckpt_path`, `data_dir` and `schma_dir` has been set. Please set the path to be the absolute full path, e.g:"/username/checkpoint_100_300.ckpt".
  234. ```bash
  235. bash scripts/run_standalone_gd.sh
  236. ```
  237. The command above will run in the background, you can view the results the file log.txt. After training, you will get some checkpoint files under the script folder by default. The loss value will be achieved as follows:
  238. ```text
  239. # grep "epoch" log.txt
  240. epoch: 1, step: 100, outputs are (Tensor(shape=[1], dtype=Float32, 28.2093), Tensor(shape=[], dtype=Bool, False), Tensor(shape=[], dtype=Float32, 65536))
  241. epoch: 2, step: 200, outputs are (Tensor(shape=[1], dtype=Float32, 30.1724), Tensor(shape=[], dtype=Bool, False), Tensor(shape=[], dtype=Float32, 65536))
  242. ...
  243. ```
  244. > **Attention** This will bind the processor cores according to the `device_num` and total processor numbers. If you don't expect to run pretraining with binding processor cores, remove the operations about `taskset` in `scripts/run_distributed_gd_ascend.sh`
  245. #### running on GPU
  246. Before running the command below, please check `load_teacher_ckpt_path`, `data_dir` `schma_dir` and `device_target=GPU` has been set. Please set the path to be the absolute full path, e.g:"/username/checkpoint_100_300.ckpt".
  247. ```bash
  248. bash scripts/run_standalone_gd.sh
  249. ```
  250. The command above will run in the background, you can view the results the file log.txt. After training, you will get some checkpoint files under the script folder by default. The loss value will be achieved as follows:
  251. ```text
  252. # grep "epoch" log.txt
  253. epoch: 1, step: 100, outputs are 28.2093
  254. ...
  255. ```
  256. ### Distributed Training
  257. #### running on Ascend
  258. Before running the command below, please check `load_teacher_ckpt_path`, `data_dir` and `schma_dir` has been set. Please set the path to be the absolute full path, e.g:"/username/checkpoint_100_300.ckpt".
  259. ```bash
  260. bash scripts/run_distributed_gd_ascend.sh 8 1 /path/hccl.json
  261. ```
  262. The command above will run in the background, you can view the results the file log.txt. After training, you will get some checkpoint files under the LOG* folder by default. The loss value will be achieved as follows:
  263. ```text
  264. # grep "epoch" LOG*/log.txt
  265. epoch: 1, step: 100, outputs are (Tensor(shape=[1], dtype=Float32, 28.1478), Tensor(shape=[], dtype=Bool, False), Tensor(shape=[], dtype=Float32, 65536))
  266. ...
  267. epoch: 1, step: 100, outputs are (Tensor(shape=[1], dtype=Float32, 30.5901), Tensor(shape=[], dtype=Bool, False), Tensor(shape=[], dtype=Float32, 65536))
  268. ...
  269. ```
  270. #### running on GPU
  271. Please input the path to be the absolute full path, e.g:"/username/checkpoint_100_300.ckpt".
  272. ```bash
  273. bash scripts/run_distributed_gd_gpu.sh 8 1 /path/data/ /path/schema.json /path/teacher.ckpt
  274. ```
  275. The command above will run in the background, you can view the results the file log.txt. After training, you will get some checkpoint files under the LOG* folder by default. The loss value will be achieved as follows:
  276. ```text
  277. # grep "epoch" LOG*/log.txt
  278. epoch: 1, step: 1, outputs are 63.4098
  279. ...
  280. ```
  281. ## [Evaluation Process](#contents)
  282. ### Evaluation
  283. If you want to after running and continue to eval, please set `do_train=true` and `do_eval=true`, If you want to run eval alone, please set `do_train=false` and `do_eval=true`. If running on GPU, please set `device_target=GPU`.
  284. #### evaluation on SST-2 dataset
  285. ```bash
  286. bash scripts/run_standalone_td.sh
  287. ```
  288. The command above will run in the background, you can view the results the file log.txt. The accuracy of the test dataset will be as follows:
  289. ```bash
  290. # grep "The best acc" log.txt
  291. The best acc is 0.872685
  292. The best acc is 0.893515
  293. The best acc is 0.899305
  294. ...
  295. The best acc is 0.902777
  296. ...
  297. ```
  298. #### evaluation on MNLI dataset
  299. Before running the command below, please check the load pretrain checkpoint path has been set. Please set the checkpoint path to be the absolute full path, e.g:"/username/pretrain/checkpoint_100_300.ckpt".
  300. ```bash
  301. bash scripts/run_standalone_td.sh
  302. ```
  303. The command above will run in the background, you can view the results the file log.txt. The accuracy of the test dataset will be as follows:
  304. ```text
  305. # grep "The best acc" log.txt
  306. The best acc is 0.803206
  307. The best acc is 0.803308
  308. The best acc is 0.810355
  309. ...
  310. The best acc is 0.813929
  311. ...
  312. ```
  313. #### evaluation on QNLI dataset
  314. Before running the command below, please check the load pretrain checkpoint path has been set. Please set the checkpoint path to be the absolute full path, e.g:"/username/pretrain/checkpoint_100_300.ckpt".
  315. ```bash
  316. bash scripts/run_standalone_td.sh
  317. ```
  318. The command above will run in the background, you can view the results the file log.txt. The accuracy of the test dataset will be as follows:
  319. ```text
  320. # grep "The best acc" log.txt
  321. The best acc is 0.870772
  322. The best acc is 0.871691
  323. The best acc is 0.875183
  324. ...
  325. The best acc is 0.891176
  326. ...
  327. ```
  328. ## [Model Description](#contents)
  329. ## [Performance](#contents)
  330. ### training Performance
  331. | Parameters | Ascend | GPU |
  332. | -------------------------- | ---------------------------------------------------------- | ------------------------- |
  333. | Model Version | TinyBERT | TinyBERT |
  334. | Resource |Ascend 910; cpu 2.60GHz, 192cores; memory 755G; OS Euler2.8 | NV SMX2 V100-32G, cpu:2.10GHz 64cores, memory:251G |
  335. | uploaded Date | 08/20/2020 | 08/24/2020 |
  336. | MindSpore Version | 1.0.0 | 1.0.0 |
  337. | Dataset | en-wiki-128 | en-wiki-128 |
  338. | Training Parameters | src/gd_config.py | src/gd_config.py |
  339. | Optimizer | AdamWeightDecay | AdamWeightDecay |
  340. | Loss Function | SoftmaxCrossEntropy | SoftmaxCrossEntropy |
  341. | outputs | probability | probability |
  342. | Loss | 6.541583 | 6.6915 |
  343. | Speed | 35.4ms/step | 98.654ms/step |
  344. | Total time | 17.3h(3poch, 8p) | 48h(3poch, 8p) |
  345. | Params (M) | 15M | 15M |
  346. | Checkpoint for task distill| 74M(.ckpt file) | 74M(.ckpt file) |
  347. | Scripts | [TinyBERT](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/official/nlp/tinybert) | |
  348. #### Inference Performance
  349. | Parameters | Ascend | GPU |
  350. | -------------------------- | ----------------------------- | ------------------------- |
  351. | Model Version | | |
  352. | Resource | Ascend 910; OS Euler2.8 | NV SMX2 V100-32G |
  353. | uploaded Date | 08/20/2020 | 08/24/2020 |
  354. | MindSpore Version | 1.0.0 | 1.0.0 |
  355. | Dataset | SST-2, | SST-2 |
  356. | batch_size | 32 | 32 |
  357. | Accuracy | 0.902777 | 0.9086 |
  358. | Speed | | |
  359. | Total time | | |
  360. | Model for inference | 74M(.ckpt file) | 74M(.ckpt file) |
  361. # [Description of Random Situation](#contents)
  362. In run_standaloned_td.sh, we set do_shuffle to shuffle the dataset.
  363. In gd_config.py and td_config.py, we set the hidden_dropout_prob and attention_pros_dropout_prob to dropout some network node.
  364. In run_general_distill.py, we set the random seed to make sure distribute training has the same init weight.
  365. # [ModelZoo Homepage](#contents)
  366. Please check the official [homepage](https://gitee.com/mindspore/mindspore/tree/master/model_zoo).