You can not select more than 25 topics Topics must start with a chinese character,a letter or number, can include dashes ('-') and can be up to 35 characters long.

README.md 14 kB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278
  1. # Contents
  2. - [Transfomer Description](#transformer-description)
  3. - [Model Architecture](#model-architecture)
  4. - [Dataset](#dataset)
  5. - [Environment Requirements](#environment-requirements)
  6. - [Quick Start](#quick-start)
  7. - [Script Description](#script-description)
  8. - [Script and Sample Code](#script-and-sample-code)
  9. - [Script Parameters](#script-parameters)
  10. - [Dataset Preparation](#dataset-preparation)
  11. - [Training Process](#training-process)
  12. - [Evaluation Process](#evaluation-process)
  13. - [Model Description](#model-description)
  14. - [Performance](#performance)
  15. - [Training Performance](#training-performance)
  16. - [Evaluation Performance](#evaluation-performance)
  17. - [Description of Random Situation](#description-of-random-situation)
  18. - [ModelZoo Homepage](#modelzoo-homepage)
  19. # [Transfomer Description](#contents)
  20. Transformer was proposed in 2017 and designed to process sequential data. It is adopted mainly in the field of natural language processing(NLP), for tasks like machine translation or text summarization. Unlike traditional recurrent neural network(RNN) which processes data in order, Transformer adopts attention mechanism and improve the parallelism, therefore reduced training times and made training on larger datasets possible. Since Transformer model was introduced, it has been used to tackle many problems in NLP and derives many network models, such as BERT(Bidirectional Encoder Representations from Transformers) and GPT(Generative Pre-trained Transformer).
  21. [Paper](https://arxiv.org/abs/1706.03762): Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, Ł ukaszKaiser, and Illia Polosukhin. 2017. Attention is all you need. In NIPS 2017, pages 5998–6008.
  22. # [Model Architecture](#contents)
  23. Specifically, Transformer contains six encoder modules and six decoder modules. Each encoder module consists of a self-attention layer and a feed forward layer, each decoder module consists of a self-attention layer, a encoder-decoder-attention layer and a feed forward layer.
  24. # [Dataset](#contents)
  25. - *WMT Englis-German* for training.
  26. - *WMT newstest2014* for evaluation.
  27. # [Environment Requirements](#contents)
  28. - Hardware(Ascend)
  29. - Prepare hardware environment with Ascend processor. If you want to try Ascend , please send the [application form](https://obs-9be7.obs.cn-east-2.myhuaweicloud.com/file/other/Ascend%20Model%20Zoo%E4%BD%93%E9%AA%8C%E8%B5%84%E6%BA%90%E7%94%B3%E8%AF%B7%E8%A1%A8.docx) to ascend@huawei.com. Once approved, you can get the resources.
  30. - Framework
  31. - [MindSpore](https://gitee.com/mindspore/mindspore)
  32. - For more information, please check the resources below:
  33. - [MindSpore tutorials](https://www.mindspore.cn/tutorial/en/master/index.html)
  34. - [MindSpore API](https://www.mindspore.cn/api/en/master/index.html)
  35. # [Quick Start](#contents)
  36. After dataset preparation, you can start training and evaluation as follows:
  37. ```bash
  38. # run training example
  39. sh scripts/run_standalone_train_ascend.sh 0 52 /path/ende-l128-mindrecord00
  40. # run distributed training example
  41. sh scripts/run_distribute_train_ascend.sh 8 52 /path/newstest2014-l128-mindrecord rank_table.json
  42. # run evaluation example
  43. python eval.py > eval.log 2>&1 &
  44. ```
  45. # [Script Description](#contents)
  46. ## [Script and Sample Code](#contents)
  47. ```shell
  48. .
  49. └─Transformer
  50. ├─README.md
  51. ├─scripts
  52. ├─process_output.sh
  53. ├─replace-quote.perl
  54. ├─run_distribute_train_ascend.sh
  55. └─run_standalone_train_ascend.sh
  56. ├─src
  57. ├─__init__.py
  58. ├─beam_search.py
  59. ├─config.py
  60. ├─dataset.py
  61. ├─eval_config.py
  62. ├─lr_schedule.py
  63. ├─process_output.py
  64. ├─tokenization.py
  65. ├─transformer_for_train.py
  66. ├─transformer_model.py
  67. └─weight_init.py
  68. ├─create_data.py
  69. ├─eval.py
  70. └─train.py
  71. ```
  72. ## [Script Parameters](#contents)
  73. ### Training Script Parameters
  74. ```
  75. usage: train.py [--distribute DISTRIBUTE] [--epoch_size N] [----device_num N] [--device_id N]
  76. [--enable_save_ckpt ENABLE_SAVE_CKPT]
  77. [--enable_lossscale ENABLE_LOSSSCALE] [--do_shuffle DO_SHUFFLE]
  78. [--enable_data_sink ENABLE_DATA_SINK] [--save_checkpoint_steps N]
  79. [--save_checkpoint_num N] [--save_checkpoint_path SAVE_CHECKPOINT_PATH]
  80. [--data_path DATA_PATH]
  81. options:
  82. --distribute pre_training by serveral devices: "true"(training by more than 1 device) | "false", default is "false"
  83. --epoch_size epoch size: N, default is 52
  84. --device_num number of used devices: N, default is 1
  85. --device_id device id: N, default is 0
  86. --enable_save_ckpt enable save checkpoint: "true" | "false", default is "true"
  87. --enable_lossscale enable lossscale: "true" | "false", default is "true"
  88. --do_shuffle enable shuffle: "true" | "false", default is "true"
  89. --enable_data_sink enable data sink: "true" | "false", default is "false"
  90. --checkpoint_path path to load checkpoint files: PATH, default is ""
  91. --save_checkpoint_steps steps for saving checkpoint files: N, default is 2500
  92. --save_checkpoint_num number for saving checkpoint files: N, default is 30
  93. --save_checkpoint_path path to save checkpoint files: PATH, default is "./checkpoint/"
  94. --data_path path to dataset file: PATH, default is ""
  95. ```
  96. ### Running Options
  97. ```
  98. config.py:
  99. transformer_network version of Transformer model: base | large, default is large
  100. init_loss_scale_value initial value of loss scale: N, default is 2^10
  101. scale_factor factor used to update loss scale: N, default is 2
  102. scale_window steps for once updatation of loss scale: N, default is 2000
  103. optimizer optimizer used in the network: Adam, default is "Adam"
  104. eval_config.py:
  105. transformer_network version of Transformer model: base | large, default is large
  106. data_file data file: PATH
  107. model_file checkpoint file to be loaded: PATH
  108. output_file output file of evaluation: PATH
  109. ```
  110. ### Network Parameters
  111. ```
  112. Parameters for dataset and network (Training/Evaluation):
  113. batch_size batch size of input dataset: N, default is 96
  114. seq_length length of input sequence: N, default is 128
  115. vocab_size size of each embedding vector: N, default is 36560
  116. hidden_size size of Transformer encoder layers: N, default is 1024
  117. num_hidden_layers number of hidden layers: N, default is 6
  118. num_attention_heads number of attention heads: N, default is 16
  119. intermediate_size size of intermediate layer: N, default is 4096
  120. hidden_act activation function used: ACTIVATION, default is "relu"
  121. hidden_dropout_prob dropout probability for TransformerOutput: Q, default is 0.3
  122. attention_probs_dropout_prob dropout probability for TransformerAttention: Q, default is 0.3
  123. max_position_embeddings maximum length of sequences: N, default is 128
  124. initializer_range initialization value of TruncatedNormal: Q, default is 0.02
  125. label_smoothing label smoothing setting: Q, default is 0.1
  126. input_mask_from_dataset use the input mask loaded form dataset or not: True | False, default is True
  127. beam_width beam width setting: N, default is 4
  128. max_decode_length max decode length in evaluation: N, default is 80
  129. length_penalty_weight normalize scores of translations according to their length: Q, default is 1.0
  130. compute_type compute type in Transformer: mstype.float16 | mstype.float32, default is mstype.float16
  131. Parameters for learning rate:
  132. learning_rate value of learning rate: Q
  133. warmup_steps steps of the learning rate warm up: N
  134. start_decay_step step of the learning rate to decay: N
  135. min_lr minimal learning rate: Q
  136. ```
  137. ## [Dataset Preparation](#contents)
  138. - You may use this [shell script](https://github.com/tensorflow/nmt/blob/master/nmt/scripts/wmt16_en_de.sh) to download and preprocess WMT English-German dataset. Assuming you get the following files:
  139. - train.tok.clean.bpe.32000.en
  140. - train.tok.clean.bpe.32000.de
  141. - vocab.bpe.32000
  142. - newstest2014.tok.bpe.32000.en
  143. - newstest2014.tok.bpe.32000.de
  144. - newstest2014.tok.de
  145. - Convert the original data to mindrecord for training:
  146. ``` bash
  147. paste train.tok.clean.bpe.32000.en train.tok.clean.bpe.32000.de > train.all
  148. python create_data.py --input_file train.all --vocab_file vocab.bpe.32000 --output_file /path/ende-l128-mindrecord --max_seq_length 128
  149. ```
  150. - Convert the original data to mindrecord for evaluation:
  151. ``` bash
  152. paste newstest2014.tok.bpe.32000.en newstest2014.tok.bpe.32000.de > test.all
  153. python create_data.py --input_file test.all --vocab_file vocab.bpe.32000 --output_file /path/newstest2014-l128-mindrecord --num_splits 1 --max_seq_length 128 --clip_to_max_len True
  154. ```
  155. ## [Training Process](#contents)
  156. - Set options in `config.py`, including loss_scale, learning rate and network hyperparameters. Click [here](https://www.mindspore.cn/tutorial/zh-CN/master/use/data_preparation/loading_the_datasets.html#mindspore) for more information about dataset.
  157. - Run `run_standalone_train_ascend.sh` for non-distributed training of Transformer model.
  158. ``` bash
  159. sh scripts/run_standalone_train_ascend.sh DEVICE_ID EPOCH_SIZE DATA_PATH
  160. ```
  161. - Run `run_distribute_train_ascend.sh` for distributed training of Transformer model.
  162. ``` bash
  163. sh scripts/run_distribute_train_ascend.sh DEVICE_NUM EPOCH_SIZE DATA_PATH RANK_TABLE_FILE
  164. ```
  165. ## [Evaluation Process](#contents)
  166. - Set options in `eval_config.py`. Make sure the 'data_file', 'model_file' and 'output_file' are set to your own path.
  167. - Run `eval.py` for evaluation of Transformer model.
  168. ```bash
  169. python eval.py
  170. ```
  171. - Run `process_output.sh` to process the output token ids to get the real translation results.
  172. ```bash
  173. sh scripts/process_output.sh REF_DATA EVAL_OUTPUT VOCAB_FILE
  174. ```
  175. You will get two files, REF_DATA.forbleu and EVAL_OUTPUT.forbleu, for BLEU score calculation.
  176. - Calculate BLEU score, you may use this [perl script](https://github.com/moses-smt/mosesdecoder/blob/master/scripts/generic/multi-bleu.perl) and run following command to get the BLEU score.
  177. ```bash
  178. perl multi-bleu.perl REF_DATA.forbleu < EVAL_OUTPUT.forbleu
  179. ```
  180. # [Model Description](#contents)
  181. ## [Performance](#contents)
  182. ### Training Performance
  183. | Parameters | Transformer |
  184. | -------------------------- | -------------------------------------------------------------- |
  185. | Resource | Ascend 910 |
  186. | uploaded Date | 06/09/2020 (month/day/year) |
  187. | MindSpore Version | 0.5.0-beta |
  188. | Dataset | WMT Englis-German |
  189. | Training Parameters | epoch=52, batch_size=96 |
  190. | Optimizer | Adam |
  191. | Loss Function | Softmax Cross Entropy |
  192. | BLEU Score | 28.7 |
  193. | Speed | 400ms/step (8pcs) |
  194. | Loss | 2.8 |
  195. | Params (M) | 213.7 |
  196. | Checkpoint for inference | 2.4G (.ckpt file) |
  197. | Scripts | https://gitee.com/mindspore/mindspore/tree/master/model_zoo/official/nlp/transformer |
  198. ### Evaluation Performance
  199. | Parameters | GoogleNet |
  200. | ------------------- | --------------------------- |
  201. | Resource | Ascend 910 |
  202. | Uploaded Date | 06/09/2020 (month/day/year) |
  203. | MindSpore Version | 0.5.0-beta |
  204. | Dataset | WMT newstest2014 |
  205. | batch_size | 1 |
  206. | outputs | BLEU score |
  207. | Accuracy | BLEU=28.7 |
  208. # [Description of Random Situation](#contents)
  209. There are three random situations:
  210. - Shuffle of the dataset.
  211. - Initialization of some model weights.
  212. - Dropout operations.
  213. Some seeds have already been set in train.py to avoid the randomness of dataset shuffle and weight initialization. If you want to disable dropout, please set the corresponding dropout_prob parameter to 0 in src/config.py.
  214. # [ModelZoo Homepage](#contents)
  215. Please check the official [homepage](https://gitee.com/mindspore/mindspore/tree/master/model_zoo).