|
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176 |
- # Transformer Example
- ## Description
- This example implements training and evaluation of Transformer Model, which is introduced in the following paper:
- - Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, Ł ukaszKaiser, and Illia Polosukhin. 2017. Attention is all you need. In NIPS 2017, pages 5998–6008.
-
- ## Requirements
- - Install [MindSpore](https://www.mindspore.cn/install/en).
- - Download and preprocess the WMT English-German dataset for training and evaluation.
-
- > Notes:If you are running an evaluation task, prepare the corresponding checkpoint file.
-
- ## Example structure
-
- ```shell
- .
- └─Transformer
- ├─README.md
- ├─scripts
- ├─process_output.sh
- ├─replace-quote.perl
- ├─run_distribute_train.sh
- └─run_standalone_train.sh
- ├─src
- ├─__init__.py
- ├─beam_search.py
- ├─config.py
- ├─dataset.py
- ├─eval_config.py
- ├─lr_schedule.py
- ├─process_output.py
- ├─tokenization.py
- ├─transformer_for_train.py
- ├─transformer_model.py
- └─weight_init.py
- ├─create_data.py
- ├─eval.py
- └─train.py
- ```
-
- ---
-
- ## Prepare the dataset
- - You may use this [shell script](https://github.com/tensorflow/nmt/blob/master/nmt/scripts/wmt16_en_de.sh) to download and preprocess WMT English-German dataset. Assuming you get the following files:
- - train.tok.clean.bpe.32000.en
- - train.tok.clean.bpe.32000.de
- - vocab.bpe.32000
- - newstest2014.tok.bpe.32000.en
- - newstest2014.tok.bpe.32000.de
- - newstest2014.tok.de
-
- - Convert the original data to mindrecord for training:
-
- ``` bash
- paste train.tok.clean.bpe.32000.en train.tok.clean.bpe.32000.de > train.all
- python create_data.py --input_file train.all --vocab_file vocab.bpe.32000 --output_file /path/ende-l128-mindrecord --max_seq_length 128
- ```
- - Convert the original data to mindrecord for evaluation:
-
- ``` bash
- paste newstest2014.tok.bpe.32000.en newstest2014.tok.bpe.32000.de > test.all
- python create_data.py --input_file test.all --vocab_file vocab.bpe.32000 --output_file /path/newstest2014-l128-mindrecord --num_splits 1 --max_seq_length 128 --clip_to_max_len True
- ```
-
- ## Running the example
-
- ### Training
- - Set options in `config.py`, including loss_scale, learning rate and network hyperparameters. Click [here](https://www.mindspore.cn/tutorial/zh-CN/master/use/data_preparation/loading_the_datasets.html#mindspore) for more information about dataset.
-
- - Run `run_standalone_train.sh` for non-distributed training of Transformer model.
-
- ``` bash
- sh scripts/run_standalone_train.sh DEVICE_ID EPOCH_SIZE DATA_PATH
- ```
- - Run `run_distribute_train.sh` for distributed training of Transformer model.
-
- ``` bash
- sh scripts/run_distribute_train.sh DEVICE_NUM EPOCH_SIZE DATA_PATH MINDSPORE_HCCL_CONFIG_PATH
- ```
-
- ### Evaluation
- - Set options in `eval_config.py`. Make sure the 'data_file', 'model_file' and 'output_file' are set to your own path.
-
- - Run `eval.py` for evaluation of Transformer model.
-
- ```bash
- python eval.py
- ```
-
- - Run `process_output.sh` to process the output token ids to get the real translation results.
-
- ```bash
- sh scripts/process_output.sh REF_DATA EVAL_OUTPUT VOCAB_FILE
- ```
- You will get two files, REF_DATA.forbleu and EVAL_OUTPUT.forbleu, for BLEU score calculation.
-
- - Calculate BLEU score, you may use this [perl script](https://github.com/moses-smt/mosesdecoder/blob/master/scripts/generic/multi-bleu.perl) and run following command to get the BLEU score.
-
- ```bash
- perl multi-bleu.perl REF_DATA.forbleu < EVAL_OUTPUT.forbleu
- ```
-
- ---
-
- ## Usage
-
- ### Training
- ```
- usage: train.py [--distribute DISTRIBUTE] [--epoch_size N] [----device_num N] [--device_id N]
- [--enable_save_ckpt ENABLE_SAVE_CKPT]
- [--enable_lossscale ENABLE_LOSSSCALE] [--do_shuffle DO_SHUFFLE]
- [--enable_data_sink ENABLE_DATA_SINK] [--save_checkpoint_steps N]
- [--save_checkpoint_num N] [--save_checkpoint_path SAVE_CHECKPOINT_PATH]
- [--data_path DATA_PATH]
-
- options:
- --distribute pre_training by serveral devices: "true"(training by more than 1 device) | "false", default is "false"
- --epoch_size epoch size: N, default is 52
- --device_num number of used devices: N, default is 1
- --device_id device id: N, default is 0
- --enable_save_ckpt enable save checkpoint: "true" | "false", default is "true"
- --enable_lossscale enable lossscale: "true" | "false", default is "true"
- --do_shuffle enable shuffle: "true" | "false", default is "true"
- --enable_data_sink enable data sink: "true" | "false", default is "false"
- --checkpoint_path path to load checkpoint files: PATH, default is ""
- --save_checkpoint_steps steps for saving checkpoint files: N, default is 2500
- --save_checkpoint_num number for saving checkpoint files: N, default is 30
- --save_checkpoint_path path to save checkpoint files: PATH, default is "./checkpoint/"
- --data_path path to dataset file: PATH, default is ""
- ```
-
- ## Options and Parameters
- It contains of parameters of Transformer model and options for training and evaluation, which is set in file `config.py` and `evaluation_config.py` respectively.
- ### Options:
- ```
- config.py:
- transformer_network version of Transformer model: base | large, default is large
- init_loss_scale_value initial value of loss scale: N, default is 2^10
- scale_factor factor used to update loss scale: N, default is 2
- scale_window steps for once updatation of loss scale: N, default is 2000
- optimizer optimizer used in the network: Adam, default is "Adam"
-
- eval_config.py:
- transformer_network version of Transformer model: base | large, default is large
- data_file data file: PATH
- model_file checkpoint file to be loaded: PATH
- output_file output file of evaluation: PATH
- ```
-
- ### Parameters:
- ```
- Parameters for dataset and network (Training/Evaluation):
- batch_size batch size of input dataset: N, default is 96
- seq_length length of input sequence: N, default is 128
- vocab_size size of each embedding vector: N, default is 36560
- hidden_size size of Transformer encoder layers: N, default is 1024
- num_hidden_layers number of hidden layers: N, default is 6
- num_attention_heads number of attention heads: N, default is 16
- intermediate_size size of intermediate layer: N, default is 4096
- hidden_act activation function used: ACTIVATION, default is "relu"
- hidden_dropout_prob dropout probability for TransformerOutput: Q, default is 0.3
- attention_probs_dropout_prob dropout probability for TransformerAttention: Q, default is 0.3
- max_position_embeddings maximum length of sequences: N, default is 128
- initializer_range initialization value of TruncatedNormal: Q, default is 0.02
- label_smoothing label smoothing setting: Q, default is 0.1
- input_mask_from_dataset use the input mask loaded form dataset or not: True | False, default is True
- beam_width beam width setting: N, default is 4
- max_decode_length max decode length in evaluation: N, default is 80
- length_penalty_weight normalize scores of translations according to their length: Q, default is 1.0
- compute_type compute type in Transformer: mstype.float16 | mstype.float32, default is mstype.float16
-
- Parameters for learning rate:
- learning_rate value of learning rate: Q
- warmup_steps steps of the learning rate warm up: N
- start_decay_step step of the learning rate to decay: N
- min_lr minimal learning rate: Q
- ```
|