You can not select more than 25 topics Topics must start with a chinese character,a letter or number, can include dashes ('-') and can be up to 35 characters long.

README.md 30 kB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478479480481482483484485486487488489490491492493494495496497498499500501502503504505506507508509510511512513514515516517518519520521522523524525526527528529530531532533534535536537538539540541542543544545546547548549550551552553554555556557558559560561562563564565566567568569570571572573574575576577578579580581582583584585586587588589590591592593594595596597598599600601602603604605606607608609610611612613614615616617618619620621622623624625626627628629630631632633634
  1. ![](https://www.mindspore.cn/static/img/logo.a3e472c9.png)
  2. <!-- TOC -->
  3. - [MASS: Masked Sequence to Sequence Pre-training for Language Generation Description](#mass-description)
  4. - [Model Architecture](#model-architecture)
  5. - [Dataset](#dataset)
  6. - [Features](#features)
  7. - [Script description](#script-description)
  8. - [Data Preparation](#Data-Preparation)
  9. - [Tokenization](#Tokenization)
  10. - [Byte Pair Encoding](#Byte-Pair-Encoding)
  11. - [Build Vocabulary](#Build-Vocabulary)
  12. - [Generate Dataset](#Generate-Dataset)
  13. - [News Crawl Corpus](#News-Crawl-Corpus)
  14. - [Gigaword Corpus](#Gigaword-Corpus)
  15. - [Cornell Movie Dialog Corpus](#Cornell-Movie-Dialog-Corpus)
  16. - [Configuration](#Configuration)
  17. - [Training & Evaluation process](#Training-&-Evaluation-process)
  18. - [Weights average](#Weights-average)
  19. - [Learning rate scheduler](#Learning-rate-scheduler)
  20. - [Environment Requirements](#environment-requirements)
  21. - [Platform](#Platform)
  22. - [Requirements](#Requirements)
  23. - [Get started](#get-started)
  24. - [Pre-training](#Pre-training)
  25. - [Fine-tuning](#Fine-tuning)
  26. - [Inference](#Inference)
  27. - [Performance](#performance)
  28. - [Results](#results)
  29. - [Training Performance](#training-performance)
  30. - [Inference Performance](#inference-performance)
  31. - [Description of random situation](#description-of-random-situation)
  32. - [others](#others)
  33. - [ModelZoo Homepage](#modelzoo-homepage)
  34. <!-- /TOC -->
  35. # MASS: Masked Sequence to Sequence Pre-training for Language Generation Description
  36. [MASS: Masked Sequence to Sequence Pre-training for Language Generation](https://www.microsoft.com/en-us/research/uploads/prod/2019/06/MASS-paper-updated-002.pdf) was released by MicroSoft in June 2019.
  37. BERT(Devlin et al., 2018) have achieved SOTA in natural language understanding area by pre-training the encoder part of Transformer(Vaswani et al., 2017) with masked rich-resource text. Likewise, GPT(Raddford et al., 2018) pre-trains the decoder part of Transformer with masked(encoder inputs are masked) rich-resource text. Both of them build a robust language model by pre-training with masked rich-resource text.
  38. Inspired by BERT, GPT and other language models, MicroSoft addressed [MASS: Masked Sequence to Sequence Pre-training for Language Generation](https://www.microsoft.com/en-us/research/uploads/prod/2019/06/MASS-paper-updated-002.pdf) which combines BERT's and GPT's idea. MASS has an important parameter k, which controls the masked fragment length. BERT and GPT are specicl case when k equals to 1 and sentence length.
  39. [Introducing MASS – A pre-training method that outperforms BERT and GPT in sequence to sequence language generation tasks](https://www.microsoft.com/en-us/research/blog/introducing-mass-a-pre-training-method-that-outperforms-bert-and-gpt-in-sequence-to-sequence-language-generation-tasks/)
  40. [Paper](https://www.microsoft.com/en-us/research/uploads/prod/2019/06/MASS-paper-updated-002.pdf): Song, Kaitao, Xu Tan, Tao Qin, Jianfeng Lu and Tie-Yan Liu. “MASS: Masked Sequence to Sequence Pre-training for Language Generation.” ICML (2019).
  41. # Model Architecture
  42. The MASS network is implemented by Transformer, which has multi-encoder layers and multi-decoder layers.
  43. For pre-training, we use the Adam optimizer and loss-scale to get the pre-trained model.
  44. During fine-turning, we fine-tune this pre-trained model with different dataset according to different tasks.
  45. During testing, we use the fine-turned model to predict the result, and adopt a beam search algorithm to
  46. get the most possible prediction results.
  47. # Dataset
  48. Dataset used:
  49. - monolingual English data from News Crawl dataset(WMT 2019) for pre-training.
  50. - Gigaword Corpus(Graff et al., 2003) for Text Summarization.
  51. - Cornell movie dialog corpus(DanescuNiculescu-Mizil & Lee, 2011).
  52. Details about those dataset could be found in [MASS: Masked Sequence to Sequence Pre-training for Language Generation](https://www.microsoft.com/en-us/research/uploads/prod/2019/06/MASS-paper-updated-002.pdf).
  53. # Features
  54. Mass is designed to jointly pre train encoder and decoder to complete the task of language generation.
  55. First of all, through a sequence to sequence framework, mass only predicts the blocked token, which forces the encoder to understand the meaning of the unshielded token, and encourages the decoder to extract useful information from the encoder.
  56. Secondly, by predicting the continuous token of the decoder, the decoder can build better language modeling ability than only predicting discrete token.
  57. Third, by further shielding the input token of the decoder which is not shielded in the encoder, the decoder is encouraged to extract more useful information from the encoder side, rather than using the rich information in the previous token.
  58. # Script description
  59. MASS script and code structure are as follow:
  60. ```text
  61. ├── mass
  62. ├── README.md // Introduction of MASS model.
  63. ├── config
  64. │ ├──config.py // Configuration instance definition.
  65. │ ├──config.json // Configuration file.
  66. ├── src
  67. │ ├──dataset
  68. │ ├──bi_data_loader.py // Dataset loader for fine-tune or inferring.
  69. │ ├──mono_data_loader.py // Dataset loader for pre-training.
  70. │ ├──language_model
  71. │ ├──noise_channel_language_model.p // Noisy channel language model for dataset generation.
  72. │ ├──mass_language_model.py // MASS language model according to MASS paper.
  73. │ ├──loose_masked_language_model.py // MASS language model according to MASS released code.
  74. │ ├──masked_language_model.py // Masked language model according to MASS paper.
  75. │ ├──transformer
  76. │ ├──create_attn_mask.py // Generate mask matrix to remove padding positions.
  77. │ ├──transformer.py // Transformer model architecture.
  78. │ ├──encoder.py // Transformer encoder component.
  79. │ ├──decoder.py // Transformer decoder component.
  80. │ ├──self_attention.py // Self-Attention block component.
  81. │ ├──multi_head_attention.py // Multi-Head Self-Attention component.
  82. │ ├──embedding.py // Embedding component.
  83. │ ├──positional_embedding.py // Positional embedding component.
  84. │ ├──feed_forward_network.py // Feed forward network.
  85. │ ├──residual_conn.py // Residual block.
  86. │ ├──beam_search.py // Beam search decoder for inferring.
  87. │ ├──transformer_for_infer.py // Use Transformer to infer.
  88. │ ├──transformer_for_train.py // Use Transformer to train.
  89. │ ├──utils
  90. │ ├──byte_pair_encoding.py // Apply BPE with subword-nmt.
  91. │ ├──dictionary.py // Dictionary.
  92. │ ├──loss_moniter.py // Callback of monitering loss during training step.
  93. │ ├──lr_scheduler.py // Learning rate scheduler.
  94. │ ├──ppl_score.py // Perplexity score based on N-gram.
  95. │ ├──rouge_score.py // Calculate ROUGE score.
  96. │ ├──load_weights.py // Load weights from a checkpoint or NPZ file.
  97. │ ├──initializer.py // Parameters initializer.
  98. ├── vocab
  99. │ ├──all.bpe.codes // BPE codes table(this file should be generated by user).
  100. │ ├──all_en.dict.bin // Learned vocabulary file(this file should be generated by user).
  101. ├── scripts
  102. │ ├──run_ascend.sh // Ascend train & evaluate model script.
  103. │ ├──run_gpu.sh // GPU train & evaluate model script.
  104. │ ├──learn_subword.sh // Learn BPE codes.
  105. │ ├──stop_training.sh // Stop training.
  106. ├── requirements.txt // Requirements of third party package.
  107. ├── train.py // Train API entry.
  108. ├── eval.py // Infer API entry.
  109. ├── tokenize_corpus.py // Corpus tokenization.
  110. ├── apply_bpe_encoding.py // Applying bpe encoding.
  111. ├── weights_average.py // Average multi model checkpoints to NPZ format.
  112. ├── news_crawl.py // Create News Crawl dataset for pre-training.
  113. ├── gigaword.py // Create Gigaword Corpus.
  114. ├── cornell_dialog.py // Create Cornell Movie Dialog dataset for conversation response.
  115. ```
  116. ## Data Preparation
  117. The data preparation of a natural language processing task contains data cleaning, tokenization, encoding and vocabulary generation steps.
  118. In our experiments, using [Byte Pair Encoding(BPE)](https://arxiv.org/abs/1508.07909) could reduce size of vocabulary, and relieve the OOV influence effectively.
  119. Vocabulary could be created using `src/utils/dictionary.py` with text dictionary which is learnt from BPE.
  120. For more detail about BPE, please refer to [Subword-nmt lib](https://www.cnpython.com/pypi/subword-nmt) or [paper](https://arxiv.org/abs/1508.07909).
  121. In our experiments, vocabulary was learned based on 1.9M sentences from News Crawl Dataset, size of vocabulary is 45755.
  122. Here, we have a brief introduction of data preparation scripts.
  123. ### Tokenization
  124. Using `tokenize_corpus.py` could tokenize corpus whose text files are in format of `.txt`.
  125. Major parameters in `tokenize_corpus.py`:
  126. ```bash
  127. --corpus_folder: Corpus folder path, if multi-folders are provided, use ',' split folders.
  128. --output_folder: Output folder path.
  129. --tokenizer: Tokenizer to be used, nltk or jieba, if nltk is not installed fully, use jieba instead.
  130. --pool_size: Processes pool size.
  131. ```
  132. Sample code:
  133. ```bash
  134. python tokenize_corpus.py --corpus_folder /{path}/corpus --output_folder /{path}/tokenized_corpus --tokenizer {nltk|jieba} --pool_size 16
  135. ```
  136. ### Byte Pair Encoding
  137. After tokenization, BPE is applied to tokenized corpus with provided `all.bpe.codes`.
  138. Apply BPE script can be found in `apply_bpe_encoding.py`.
  139. Major parameters in `apply_bpe_encoding.py`:
  140. ```bash
  141. --codes: BPE codes file.
  142. --src_folder: Corpus folders.
  143. --output_folder: Output files folder.
  144. --prefix: Prefix of text file in `src_folder`.
  145. --vocab_path: Generated vocabulary output path.
  146. --threshold: Filter out words that frequency is lower than threshold.
  147. --processes: Size of process pool (to accelerate). Default: 2.
  148. ```
  149. Sample code:
  150. ```bash
  151. python tokenize_corpus.py --codes /{path}/all.bpe.codes \
  152. --src_folder /{path}/tokenized_corpus \
  153. --output_folder /{path}/tokenized_corpus/bpe \
  154. --prefix tokenized \
  155. --vocab_path /{path}/vocab_en.dict.bin
  156. --processes 32
  157. ```
  158. ### Build Vocabulary
  159. Support that you want to create a new vocabulary, there are two options:
  160. 1. Learn BPE codes from scratch, and create vocabulary with multi vocabulary files from `subword-nmt`.
  161. 2. Create from an existing vocabulary file which lines in the format of `word frequency`.
  162. 3. *Optional*, Create a small vocabulary based on `vocab/all_en.dict.bin` with method of `shink` from `src/utils/dictionary.py`.
  163. 4. Persistent vocabulary to `vocab` folder with method `persistence()`.
  164. Major interface of `src/utils/dictionary.py` are as follow:
  165. 1. `shrink(self, threshold=50)`: Shrink the size of vocabulary by filter out words frequency is lower than threshold. It returns a new vocabulary.
  166. 2. `load_from_text(cls, filepaths: List[str])`: Load existed text vocabulary which lines in the format of `word frequency`.
  167. 3. `load_from_persisted_dict(cls, filepath)`: Load from a persisted binary vocabulary which was saved by calling `persistence()` method.
  168. 4. `persistence(self, path)`: Save vocabulary object to binary file.
  169. Sample code:
  170. ```python
  171. from src.utils import Dictionary
  172. vocabulary = Dictionary.load_from_persisted_dict("vocab/all_en.dict.bin")
  173. tokens = [1, 2, 3, 4, 5]
  174. # Convert ids to symbols.
  175. print([vocabulary[t] for t in tokens])
  176. sentence = ["Hello", "world"]
  177. # Convert symbols to ids.
  178. print([vocabulary.index[s] for s in sentence])
  179. ```
  180. For more detail, please refer to the source file.
  181. ### Generate Dataset
  182. As mentioned above, three corpus are used in MASS mode, dataset generation scripts for them are provided.
  183. #### News Crawl Corpus
  184. Script can be found in `news_crawl.py`.
  185. Major parameters in `news_crawl.py`:
  186. ```bash
  187. Note that please provide `--existed_vocab` or `--dict_folder` at least one.
  188. A new vocabulary would be created in `output_folder` when pass `--dict_folder`.
  189. --src_folder: Corpus folders.
  190. --existed_vocab: Optional, persisted vocabulary file.
  191. --mask_ratio: Ratio of mask.
  192. --output_folder: Output dataset files folder path.
  193. --max_len: Maximum sentence length. If a sentence longer than `max_len`, then drop it.
  194. --suffix: Optional, suffix of generated dataset files.
  195. --processes: Optional, size of process pool (to accelerate). Default: 2.
  196. ```
  197. Sample code:
  198. ```bash
  199. python news_crawl.py --src_folder /{path}/news_crawl \
  200. --existed_vocab /{path}/mass/vocab/all_en.dict.bin \
  201. --mask_ratio 0.5 \
  202. --output_folder /{path}/news_crawl_dataset \
  203. --max_len 32 \
  204. --processes 32
  205. ```
  206. #### Gigaword Corpus
  207. Script can be found in `gigaword.py`.
  208. Major parameters in `gigaword.py`:
  209. ```bash
  210. --train_src: Train source file path.
  211. --train_ref: Train reference file path.
  212. --test_src: Test source file path.
  213. --test_ref: Test reference file path.
  214. --existed_vocab: Persisted vocabulary file.
  215. --output_folder: Output dataset files folder path.
  216. --noise_prob: Optional, add noise prob. Default: 0.
  217. --max_len: Optional, maximum sentence length. If a sentence longer than `max_len`, then drop it. Default: 64.
  218. --format: Optional, dataset format, "mindrecord" or "tfrecord". Default: "tfrecord".
  219. ```
  220. Sample code:
  221. ```bash
  222. python gigaword.py --train_src /{path}/gigaword/train_src.txt \
  223. --train_ref /{path}/gigaword/train_ref.txt \
  224. --test_src /{path}/gigaword/test_src.txt \
  225. --test_ref /{path}/gigaword/test_ref.txt \
  226. --existed_vocab /{path}/mass/vocab/all_en.dict.bin \
  227. --noise_prob 0.1 \
  228. --output_folder /{path}/gigaword_dataset \
  229. --max_len 64
  230. ```
  231. #### Cornell Movie Dialog Corpus
  232. Script can be found in `cornell_dialog.py`.
  233. Major parameters in `cornell_dialog.py`:
  234. ```bash
  235. --src_folder: Corpus folders.
  236. --existed_vocab: Persisted vocabulary file.
  237. --train_prefix: Train source and target file prefix. Default: train.
  238. --test_prefix: Test source and target file prefix. Default: test.
  239. --output_folder: Output dataset files folder path.
  240. --max_len: Maximum sentence length. If a sentence longer than `max_len`, then drop it.
  241. --valid_prefix: Optional, Valid source and target file prefix. Default: valid.
  242. ```
  243. Sample code:
  244. ```bash
  245. python cornell_dialog.py --src_folder /{path}/cornell_dialog \
  246. --existed_vocab /{path}/mass/vocab/all_en.dict.bin \
  247. --train_prefix train \
  248. --test_prefix test \
  249. --noise_prob 0.1 \
  250. --output_folder /{path}/cornell_dialog_dataset \
  251. --max_len 64
  252. ```
  253. ## Configuration
  254. Json file under the path `config/` is the template configuration file.
  255. Almost all of the options and arguments needed could be assigned conveniently, including the training platform, configurations of dataset and model, arguments of optimizer etc. Optional features such as loss scale and checkpoint are also available by setting the options correspondingly.
  256. For more detailed information about the attributes, refer to the file `config/config.py`.
  257. ## Training & Evaluation process
  258. For training a model, the shell script `run_ascend.sh` or `run_gpu.sh` is all you need. In this scripts, the environment variable is set and the training script `train.py` under `mass` is executed.
  259. You may start a task training with single device or multiple devices by assigning the options and run the command in bash:
  260. Ascend:
  261. ```ascend
  262. sh run_ascend.sh [--options]
  263. ```
  264. GPU:
  265. ```gpu
  266. sh run_gpu.sh [--options]
  267. ```
  268. The usage of `run_ascend.sh` is shown as bellow:
  269. ```text
  270. Usage: run_ascend.sh [-h, --help] [-t, --task <CHAR>] [-n, --device_num <N>]
  271. [-i, --device_id <N>] [-j, --hccl_json <FILE>]
  272. [-c, --config <FILE>] [-o, --output <FILE>]
  273. [-v, --vocab <FILE>]
  274. options:
  275. -h, --help show usage
  276. -t, --task select task: CHAR, 't' for train and 'i' for inference".
  277. -n, --device_num device number used for training: N, default is 1.
  278. -i, --device_id device id used for training with single device: N, 0<=N<=7, default is 0.
  279. -j, --hccl_json rank table file used for training with multiple devices: FILE.
  280. -c, --config configuration file as shown in the path 'mass/config': FILE.
  281. -o, --output assign output file of inference: FILE.
  282. -v, --vocab set the vocabulary.
  283. -m, --metric set the metric.
  284. ```
  285. Notes: Be sure to assign the hccl_json file while running a distributed-training.
  286. The usage of `run_gpu.sh` is shown as bellow:
  287. ```text
  288. Usage: run_gpu.sh [-h, --help] [-t, --task <CHAR>] [-n, --device_num <N>]
  289. [-i, --device_id <N>] [-c, --config <FILE>]
  290. [-o, --output <FILE>] [-v, --vocab <FILE>]
  291. options:
  292. -h, --help show usage
  293. -t, --task select task: CHAR, 't' for train and 'i' for inference".
  294. -n, --device_num device number used for training: N, default is 1.
  295. -i, --device_id device id used for training with single device: N, 0<=N<=7, default is 0.
  296. -c, --config configuration file as shown in the path 'mass/config': FILE.
  297. -o, --output assign output file of inference: FILE.
  298. -v, --vocab set the vocabulary.
  299. -m, --metric set the metric.
  300. ```
  301. The command followed shows a example for training with 2 devices.
  302. Ascend:
  303. ```ascend
  304. sh run_ascend.sh --task t --device_num 2 --hccl_json /{path}/rank_table.json --config /{path}/config.json
  305. ```
  306. ps. Discontinuous device id is not supported in `run_ascend.sh` at present, device id in `rank_table.json` must start from 0.
  307. GPU:
  308. ```gpu
  309. sh run_gpu.sh --task t --device_num 2 --config /{path}/config.json
  310. ```
  311. If use a single chip, it would be like this:
  312. Ascend:
  313. ```ascend
  314. sh run_ascend.sh --task t --device_num 1 --device_id 0 --config /{path}/config.json
  315. ```
  316. GPU:
  317. ```gpu
  318. sh run_gpu.sh --task t --device_num 1 --device_id 0 --config /{path}/config.json
  319. ```
  320. ## Weights average
  321. ```python
  322. python weights_average.py --input_files your_checkpoint_list --output_file model.npz
  323. ```
  324. The input_files is a list of you checkpoints file. To use model.npz as the weights, add its path in config.json at "existed_ckpt".
  325. ```json
  326. {
  327. ...
  328. "checkpoint_options": {
  329. "existed_ckpt": "/xxx/xxx/model.npz",
  330. "save_ckpt_steps": 1000,
  331. ...
  332. },
  333. ...
  334. }
  335. ```
  336. ## Learning rate scheduler
  337. Two learning rate scheduler are provided in our model:
  338. 1. [Polynomial decay scheduler](https://towardsdatascience.com/learning-rate-schedules-and-adaptive-learning-rate-methods-for-deep-learning-2c8f433990d1).
  339. 2. [Inverse square root scheduler](https://ece.uwaterloo.ca/~dwharder/aads/Algorithms/Inverse_square_root/).
  340. LR scheduler could be config in `config/config.json`.
  341. For Polynomial decay scheduler, config could be like:
  342. ```json
  343. {
  344. ...
  345. "learn_rate_config": {
  346. "optimizer": "adam",
  347. "lr": 1e-4,
  348. "lr_scheduler": "poly",
  349. "poly_lr_scheduler_power": 0.5,
  350. "decay_steps": 10000,
  351. "warmup_steps": 2000,
  352. "min_lr": 1e-6
  353. },
  354. ...
  355. }
  356. ```
  357. For Inverse square root scheduler, config could be like:
  358. ```json
  359. {
  360. ...
  361. "learn_rate_config": {
  362. "optimizer": "adam",
  363. "lr": 1e-4,
  364. "lr_scheduler": "isr",
  365. "decay_start_step": 12000,
  366. "warmup_steps": 2000,
  367. "min_lr": 1e-6
  368. },
  369. ...
  370. }
  371. ```
  372. More detail about LR scheduler could be found in `src/utils/lr_scheduler.py`.
  373. # Environment Requirements
  374. ## Platform
  375. - Hardware(Ascend/GPU)
  376. - Prepare hardware environment with Ascend or GPU processor. If you want to try Ascend , please send the [application form](https://obs-9be7.obs.cn-east-2.myhuaweicloud.com/file/other/Ascend%20Model%20Zoo%E4%BD%93%E9%AA%8C%E8%B5%84%E6%BA%90%E7%94%B3%E8%AF%B7%E8%A1%A8.docx) to ascend@huawei.com. Once approved, you can get the resources.
  377. - Framework
  378. - [MindSpore](http://10.90.67.50/mindspore/archive/20200506/OpenSource/me_vm_x86/)
  379. - For more information, please check the resources below:
  380. - [MindSpore tutorials](https://www.mindspore.cn/tutorial/zh-CN/master/index.html)
  381. - [MindSpore API](https://www.mindspore.cn/api/zh-CN/master/index.html)
  382. ## Requirements
  383. ```txt
  384. nltk
  385. numpy
  386. subword-nmt
  387. rouge
  388. ```
  389. https://www.mindspore.cn/tutorial/zh-CN/master/advanced_use/network_migration.html
  390. # Get started
  391. MASS pre-trains a sequence to sequence model by predicting the masked fragments in an input sequence. After this, downstream tasks including text summarization and conversation response are candidated for fine-tuning the model and for inference.
  392. Here we provide a practice example to demonstrate the basic usage of MASS for pre-training, fine-tuning a model, and the inference process. The overall process is as follows:
  393. 1. Download and process the dataset.
  394. 2. Modify the `config.json` to config the network.
  395. 3. Run a task for pre-training and fine-tuning.
  396. 4. Perform inference and validation.
  397. ## Pre-training
  398. For pre-training a model, config the options in `config.json` firstly:
  399. - Assign the `pre_train_dataset` under `dataset_config` node to the dataset path.
  400. - Choose the optimizer('momentum/adam/lamb' is available).
  401. - Assign the 'ckpt_prefix' and 'ckpt_path' under `checkpoint_path` to save the model files.
  402. - Set other arguments including dataset configurations and network configurations.
  403. - If you have a trained model already, assign the `existed_ckpt` to the checkpoint file.
  404. If you use the ascend chip, run the shell script `run_ascend.sh` as followed:
  405. ```ascend
  406. sh run_ascend.sh -t t -n 1 -i 1 -c /mass/config/config.json
  407. ```
  408. You can also run the shell script `run_gpu.sh` on gpu as followed:
  409. ```gpu
  410. sh run_gpu.sh -t t -n 1 -i 1 -c /mass/config/config.json
  411. ```
  412. Get the log and output files under the path `./train_mass_*/`, and the model file under the path assigned in the `config/config.json` file.
  413. ## Fine-tuning
  414. For fine-tuning a model, config the options in `config.json` firstly:
  415. - Assign the `fine_tune_dataset` under `dataset_config` node to the dataset path.
  416. - Assign the `existed_ckpt` under `checkpoint_path` node to the existed model file generated by pre-training.
  417. - Choose the optimizer('momentum/adam/lamb' is available).
  418. - Assign the `ckpt_prefix` and `ckpt_path` under `checkpoint_path` node to save the model files.
  419. - Set other arguments including dataset configurations and network configurations.
  420. If you use the ascend chip, run the shell script `run_ascend.sh` as followed:
  421. ```ascend
  422. sh run_ascend.sh -t t -n 1 -i 1 -c config/config.json
  423. ```
  424. You can also run the shell script `run_gpu.sh` on gpu as followed:
  425. ```gpu
  426. sh run_gpu.sh -t t -n 1 -i 1 -c config/config.json
  427. ```
  428. Get the log and output files under the path `./train_mass_*/`, and the model file under the path assigned in the `config/config.json` file.
  429. ## Inference
  430. If you need to use the trained model to perform inference on multiple hardware platforms, such as GPU, Ascend 910 or Ascend 310, you can refer to this [Link](https://www.mindspore.cn/tutorial/zh-CN/master/advanced_use/network_migration.html).
  431. For inference, config the options in `config.json` firstly:
  432. - Assign the `test_dataset` under `dataset_config` node to the dataset path.
  433. - Assign the `existed_ckpt` under `checkpoint_path` node to the model file produced by fine-tuning.
  434. - Choose the optimizer('momentum/adam/lamb' is available).
  435. - Assign the `ckpt_prefix` and `ckpt_path` under `checkpoint_path` node to save the model files.
  436. - Set other arguments including dataset configurations and network configurations.
  437. If you use the ascend chip, run the shell script `run_ascend.sh` as followed:
  438. ```bash
  439. sh run_ascend.sh -t i -n 1 -i 1 -c config/config.json -o {outputfile}
  440. ```
  441. You can also run the shell script `run_gpu.sh` on gpu as followed:
  442. ```gpu
  443. sh run_gpu.sh -t i -n 1 -i 1 -c config/config.json -o {outputfile}
  444. ```
  445. # Performance
  446. ## Results
  447. ### Fine-Tuning on Text Summarization
  448. The comparisons between MASS and two other pre-training methods in terms of ROUGE score on the text summarization task
  449. with 3.8M training data are as follows:
  450. | Method | RG-1(F) | RG-2(F) | RG-L(F) |
  451. |:---------------|:--------------|:-------------|:-------------|
  452. | MASS | Ongoing | Ongoing | Ongoing |
  453. ### Fine-Tuning on Conversational ResponseGeneration
  454. The comparisons between MASS and other baseline methods in terms of PPL on Cornell Movie Dialog corpus are as follows:
  455. | Method | Data = 10K | Data = 110K |
  456. |--------------------|------------------|-----------------|
  457. | MASS | Ongoing | Ongoing |
  458. ### Training Performance
  459. | Parameters | Masked Sequence to Sequence Pre-training for Language Generation |
  460. |:---------------------------|:--------------------------------------------------------------------------|
  461. | Model Version | v1 |
  462. | Resource | Ascend 910, cpu 2.60GHz, 56cores;memory, 314G |
  463. | uploaded Date | 05/24/2020 |
  464. | MindSpore Version | 0.2.0 |
  465. | Dataset | News Crawl 2007-2017 English monolingual corpus, Gigaword corpus, Cornell Movie Dialog corpus |
  466. | Training Parameters | Epoch=50, steps=XXX, batch_size=192, lr=1e-4 |
  467. | Optimizer | Adam |
  468. | Loss Function | Label smoothed cross-entropy criterion |
  469. | outputs | Sentence and probability |
  470. | Loss | Lower than 2 |
  471. | Accuracy | For conversation response, ppl=23.52, for text summarization, RG-1=29.79. |
  472. | Speed | 611.45 sentences/s |
  473. | Total time | --/-- |
  474. | Params (M) | 44.6M |
  475. | Checkpoint for Fine tuning | ---Mb, --, [A link]() |
  476. | Model for inference | ---Mb, --, [A link]() |
  477. | Scripts | [A link]() |
  478. ### Inference Performance
  479. | Parameters | Masked Sequence to Sequence Pre-training for Language Generation |
  480. |:---------------------------|:-----------------------------------------------------------|
  481. | Model Version | V1 |
  482. | Resource | Huawei 910 |
  483. | uploaded Date | 05/24/2020 |
  484. | MindSpore Version | 0.2.0 |
  485. | Dataset | Gigaword corpus, Cornell Movie Dialog corpus |
  486. | batch_size | --- |
  487. | outputs | Sentence and probability |
  488. | Accuracy | ppl=23.52 for conversation response, RG-1=29.79 for text summarization. |
  489. | Speed | ---- sentences/s |
  490. | Total time | --/-- |
  491. | Model for inference | ---Mb, --, [A link]() |
  492. # Description of random situation
  493. MASS model contains dropout operations, if you want to disable dropout, please set related dropout_rate to 0 in `config/config.json`.
  494. # others
  495. The model has been validated on Ascend environment, not validated on CPU and GPU.
  496. # ModelZoo Homepage
  497. [Link](https://gitee.com/mindspore/mindspore/tree/master/mindspore/model_zoo)