U-Net医学模型基于二维图像分割。实现方式见论文[UNet:Convolutional Networks for Biomedical Image Segmentation](https://arxiv.org/abs/1505.04597)。在2015年ISBI细胞跟踪竞赛中,U-Net获得了许多最佳奖项。论文中提出了一种用于医学图像分割的网络模型和数据增强方法,有效利用标注数据来解决医学领域标注数据不足的问题。U型网络结构也用于提取上下文和位置信息。
[论文](https://arxiv.org/abs/1505.04597): Olaf Ronneberger, Philipp Fischer, Thomas Brox. "U-Net: Convolutional Networks for Biomedical Image Segmentation." *conditionally accepted at MICCAI 2015*. 2015.
- 说明:训练和测试数据集为两组30节果蝇一龄幼虫腹神经索(VNC)的连续透射电子显微镜(ssTEM)数据集。微立方体的尺寸约为2 x 2 x 1.5微米,分辨率为4x4x50纳米/像素。
- 许可证:您可以免费使用这个数据集来生成或测试非商业图像分割软件。若科学出版物使用此数据集,则必须引用TrakEM2和以下出版物: Cardona A, Saalfeld S, Preibisch S, Schmid B, Cheng A, Pulokas J, Tomancak P, Hartenstein V. 2010. An Integrated Micro- and Macroarchitectural Analysis of the Drosophila Brain by Computer-Assisted Serial Section Electron Microscopy. PLoS Biol 8(10): e1000502. doi:10.1371/journal.pbio.1000502.
# MASS: Masked Sequence to Sequence Pre-training for Language Generation Description
# MASS: Masked Sequence to Sequence Pre-training for Language Generation Description
[MASS: Masked Sequence to Sequence Pre-training for Language Generation](https://www.microsoft.com/en-us/research/uploads/prod/2019/06/MASS-paper-updated-002.pdf) was released by MicroSoft in June 2019.
[MASS: Masked Sequence to Sequence Pre-training for Language Generation](https://www.microsoft.com/en-us/research/uploads/prod/2019/06/MASS-paper-updated-002.pdf) was released by MicroSoft in June 2019.
BERT(Devlin et al., 2018) have achieved SOTA in natural language understanding area by pre-training the encoder part of Transformer(Vaswani et al., 2017) with masked rich-resource text. Likewise, GPT(Raddford et al., 2018) pre-trains the decoder part of Transformer with masked(encoder inputs are masked) rich-resource text. Both of them build a robust language model by pre-training with masked rich-resource text.
BERT(Devlin et al., 2018) have achieved SOTA in natural language understanding area by pre-training the encoder part of Transformer(Vaswani et al., 2017) with masked rich-resource text. Likewise, GPT(Raddford et al., 2018) pre-trains the decoder part of Transformer with masked(encoder inputs are masked) rich-resource text. Both of them build a robust language model by pre-training with masked rich-resource text.
@@ -53,32 +52,31 @@ Inspired by BERT, GPT and other language models, MicroSoft addressed [MASS: Mask
# Model Architecture
# Model Architecture
The MASS network is implemented by Transformer, which has multi-encoder layers and multi-decoder layers.
For pre-training, we use the Adam optimizer and loss-scale to get the pre-trained model.
During fine-turning, we fine-tune this pre-trained model with different dataset according to different tasks.
During testing, we use the fine-turned model to predict the result, and adopt a beam search algorithm to
The MASS network is implemented by Transformer, which has multi-encoder layers and multi-decoder layers.
For pre-training, we use the Adam optimizer and loss-scale to get the pre-trained model.
During fine-turning, we fine-tune this pre-trained model with different dataset according to different tasks.
During testing, we use the fine-turned model to predict the result, and adopt a beam search algorithm to
get the most possible prediction results.
get the most possible prediction results.
# Dataset
# Dataset
Note that you can run the scripts based on the dataset mentioned in original paper or widely used in relevant domain/network architecture. In the following sections, we will introduce how to run the scripts using the related dataset below.
Note that you can run the scripts based on the dataset mentioned in original paper or widely used in relevant domain/network architecture. In the following sections, we will introduce how to run the scripts using the related dataset below.
Dataset used:
Dataset used:
- monolingual English data from News Crawl dataset(WMT 2019) for pre-training.
- monolingual English data from News Crawl dataset(WMT 2019) for pre-training.
- Gigaword Corpus(Graff et al., 2003) for Text Summarization.
- Gigaword Corpus(Graff et al., 2003) for Text Summarization.
- Cornell movie dialog corpus(DanescuNiculescu-Mizil & Lee, 2011).
- Cornell movie dialog corpus(DanescuNiculescu-Mizil & Lee, 2011).
Details about those dataset could be found in [MASS: Masked Sequence to Sequence Pre-training for Language Generation](https://www.microsoft.com/en-us/research/uploads/prod/2019/06/MASS-paper-updated-002.pdf).
Details about those dataset could be found in [MASS: Masked Sequence to Sequence Pre-training for Language Generation](https://www.microsoft.com/en-us/research/uploads/prod/2019/06/MASS-paper-updated-002.pdf).
# Features
# Features
Mass is designed to jointly pre train encoder and decoder to complete the task of language generation.
Mass is designed to jointly pre train encoder and decoder to complete the task of language generation.
First of all, through a sequence to sequence framework, mass only predicts the blocked token, which forces the encoder to understand the meaning of the unshielded token, and encourages the decoder to extract useful information from the encoder.
First of all, through a sequence to sequence framework, mass only predicts the blocked token, which forces the encoder to understand the meaning of the unshielded token, and encourages the decoder to extract useful information from the encoder.
Secondly, by predicting the continuous token of the decoder, the decoder can build better language modeling ability than only predicting discrete token.
Secondly, by predicting the continuous token of the decoder, the decoder can build better language modeling ability than only predicting discrete token.
Third, by further shielding the input token of the decoder which is not shielded in the encoder, the decoder is encouraged to extract more useful information from the encoder side, rather than using the rich information in the previous token.
Third, by further shielding the input token of the decoder which is not shielded in the encoder, the decoder is encouraged to extract more useful information from the encoder side, rather than using the rich information in the previous token.
# Script description
# Script description
MASS script and code structure are as follow:
MASS script and code structure are as follow:
@@ -90,7 +88,7 @@ MASS script and code structure are as follow:
│ ├──bi_data_loader.py // Dataset loader for fine-tune or inferring.
│ ├──bi_data_loader.py // Dataset loader for fine-tune or inferring.
│ ├──mono_data_loader.py // Dataset loader for pre-training.
│ ├──mono_data_loader.py // Dataset loader for pre-training.
│ ├──language_model
│ ├──language_model
@@ -129,7 +127,7 @@ MASS script and code structure are as follow:
│ ├──run_gpu.sh // GPU train & evaluate model script.
│ ├──run_gpu.sh // GPU train & evaluate model script.
│ ├──learn_subword.sh // Learn BPE codes.
│ ├──learn_subword.sh // Learn BPE codes.
│ ├──stop_training.sh // Stop training.
│ ├──stop_training.sh // Stop training.
├── requirements.txt // Requirements of third party package.
├── requirements.txt // Requirements of third party package.
├── train.py // Train API entry.
├── train.py // Train API entry.
├── eval.py // Infer API entry.
├── eval.py // Infer API entry.
├── tokenize_corpus.py // Corpus tokenization.
├── tokenize_corpus.py // Corpus tokenization.
@@ -141,40 +139,40 @@ MASS script and code structure are as follow:
```
```
## Data Preparation
## Data Preparation
The data preparation of a natural language processing task contains data cleaning, tokenization, encoding and vocabulary generation steps.
The data preparation of a natural language processing task contains data cleaning, tokenization, encoding and vocabulary generation steps.
In our experiments, using [Byte Pair Encoding(BPE)](https://arxiv.org/abs/1508.07909) could reduce size of vocabulary, and relieve the OOV influence effectively.
In our experiments, using [Byte Pair Encoding(BPE)](https://arxiv.org/abs/1508.07909) could reduce size of vocabulary, and relieve the OOV influence effectively.
Vocabulary could be created using `src/utils/dictionary.py` with text dictionary which is learnt from BPE.
Vocabulary could be created using `src/utils/dictionary.py` with text dictionary which is learnt from BPE.
For more detail about BPE, please refer to [Subword-nmt lib](https://www.cnpython.com/pypi/subword-nmt) or [paper](https://arxiv.org/abs/1508.07909).
For more detail about BPE, please refer to [Subword-nmt lib](https://www.cnpython.com/pypi/subword-nmt) or [paper](https://arxiv.org/abs/1508.07909).
In our experiments, vocabulary was learned based on 1.9M sentences from News Crawl Dataset, size of vocabulary is 45755.
In our experiments, vocabulary was learned based on 1.9M sentences from News Crawl Dataset, size of vocabulary is 45755.
Here, we have a brief introduction of data preparation scripts.
Here, we have a brief introduction of data preparation scripts.
### Tokenization
### Tokenization
Using `tokenize_corpus.py` could tokenize corpus whose text files are in format of `.txt`.
Using `tokenize_corpus.py` could tokenize corpus whose text files are in format of `.txt`.
Major parameters in `tokenize_corpus.py`:
Major parameters in `tokenize_corpus.py`:
```bash
```bash
--corpus_folder: Corpus folder path, if multi-folders are provided, use ',' split folders.
--output_folder: Output folder path.
--corpus_folder: Corpus folder path, if multi-folders are provided, use ',' split folders.
--output_folder: Output folder path.
--tokenizer: Tokenizer to be used, nltk or jieba, if nltk is not installed fully, use jieba instead.
--tokenizer: Tokenizer to be used, nltk or jieba, if nltk is not installed fully, use jieba instead.
Json file under the path `config/` is the template configuration file.
Almost all of the options and arguments needed could be assigned conveniently, including the training platform, configurations of dataset and model, arguments of optimizer etc. Optional features such as loss scale and checkpoint are also available by setting the options correspondingly.
Json file under the path `config/` is the template configuration file.
Almost all of the options and arguments needed could be assigned conveniently, including the training platform, configurations of dataset and model, arguments of optimizer etc. Optional features such as loss scale and checkpoint are also available by setting the options correspondingly.
For more detailed information about the attributes, refer to the file `config/config.py`.
For more detailed information about the attributes, refer to the file `config/config.py`.
## Training & Evaluation process
## Training & Evaluation process
For training a model, the shell script `run_ascend.sh` or `run_gpu.sh` is all you need. In this scripts, the environment variable is set and the training script `train.py` under `mass` is executed.
For training a model, the shell script `run_ascend.sh` or `run_gpu.sh` is all you need. In this scripts, the environment variable is set and the training script `train.py` under `mass` is executed.
You may start a task training with single device or multiple devices by assigning the options and run the command in bash:
You may start a task training with single device or multiple devices by assigning the options and run the command in bash:
-t, --task select task: CHAR, 't' for train and 'i' for inference".
-t, --task select task: CHAR, 't' for train and 'i' for inference".
@@ -382,28 +391,32 @@ options:
The command followed shows a example for training with 2 devices.
The command followed shows a example for training with 2 devices.
Ascend:
Ascend:
```ascend
```ascend
sh run_ascend.sh --task t --device_num 2 --hccl_json /{path}/rank_table.json --config /{path}/config.json
sh run_ascend.sh --task t --device_num 2 --hccl_json /{path}/rank_table.json --config /{path}/config.json
```
```
ps. Discontinuous device id is not supported in `run_ascend.sh` at present, device id in `rank_table.json` must start from 0.
ps. Discontinuous device id is not supported in `run_ascend.sh` at present, device id in `rank_table.json` must start from 0.
GPU:
GPU:
```gpu
```gpu
sh run_gpu.sh --task t --device_num 2 --config /{path}/config.json
sh run_gpu.sh --task t --device_num 2 --config /{path}/config.json
```
```
If use a single chip, it would be like this:
If use a single chip, it would be like this:
Ascend:
Ascend:
```ascend
```ascend
sh run_ascend.sh --task t --device_num 1 --device_id 0 --config /{path}/config.json
sh run_ascend.sh --task t --device_num 1 --device_id 0 --config /{path}/config.json
```
```
GPU:
GPU:
```gpu
```gpu
sh run_gpu.sh --task t --device_num 1 --device_id 0 --config /{path}/config.json
sh run_gpu.sh --task t --device_num 1 --device_id 0 --config /{path}/config.json
```
```
## Weights average
## Weights average
```python
```python
@@ -411,6 +424,7 @@ python weights_average.py --input_files your_checkpoint_list --output_file model
```
```
The input_files is a list of you checkpoints file. To use model.npz as the weights, add its path in config.json at "existed_ckpt".
The input_files is a list of you checkpoints file. To use model.npz as the weights, add its path in config.json at "existed_ckpt".
```json
```json
{
{
...
...
@@ -423,7 +437,6 @@ The input_files is a list of you checkpoints file. To use model.npz as the weigh
}
}
```
```
## Learning rate scheduler
## Learning rate scheduler
Two learning rate scheduler are provided in our model:
Two learning rate scheduler are provided in our model:
@@ -434,6 +447,7 @@ Two learning rate scheduler are provided in our model:
LR scheduler could be config in `config/config.json`.
LR scheduler could be config in `config/config.json`.
For Polynomial decay scheduler, config could be like:
For Polynomial decay scheduler, config could be like:
```json
```json
{
{
...
...
@@ -451,6 +465,7 @@ For Polynomial decay scheduler, config could be like:
```
```
For Inverse square root scheduler, config could be like:
For Inverse square root scheduler, config could be like:
```json
```json
{
{
...
...
@@ -468,19 +483,18 @@ For Inverse square root scheduler, config could be like:
More detail about LR scheduler could be found in `src/utils/lr_scheduler.py`.
More detail about LR scheduler could be found in `src/utils/lr_scheduler.py`.
# Environment Requirements
# Environment Requirements
## Platform
## Platform
- Hardware(Ascend/GPU)
- Hardware(Ascend/GPU)
- Prepare hardware environment with Ascend or GPU processor. If you want to try Ascend , please send the [application form](https://obs-9be7.obs.cn-east-2.myhuaweicloud.com/file/other/Ascend%20Model%20Zoo%E4%BD%93%E9%AA%8C%E8%B5%84%E6%BA%90%E7%94%B3%E8%AF%B7%E8%A1%A8.docx) to ascend@huawei.com. Once approved, you can get the resources.
- Prepare hardware environment with Ascend or GPU processor. If you want to try Ascend , please send the [application form](https://obs-9be7.obs.cn-east-2.myhuaweicloud.com/file/other/Ascend%20Model%20Zoo%E4%BD%93%E9%AA%8C%E8%B5%84%E6%BA%90%E7%94%B3%E8%AF%B7%E8%A1%A8.docx) to ascend@huawei.com. Once approved, you can get the resources.
MASS pre-trains a sequence to sequence model by predicting the masked fragments in an input sequence. After this, downstream tasks including text summarization and conversation response are candidated for fine-tuning the model and for inference.
MASS pre-trains a sequence to sequence model by predicting the masked fragments in an input sequence. After this, downstream tasks including text summarization and conversation response are candidated for fine-tuning the model and for inference.
Here we provide a practice example to demonstrate the basic usage of MASS for pre-training, fine-tuning a model, and the inference process. The overall process is as follows:
Here we provide a practice example to demonstrate the basic usage of MASS for pre-training, fine-tuning a model, and the inference process. The overall process is as follows:
1. Download and process the dataset.
1. Download and process the dataset.
2. Modify the `config.json` to config the network.
2. Modify the `config.json` to config the network.
3. Run a task for pre-training and fine-tuning.
3. Run a task for pre-training and fine-tuning.
4. Perform inference and validation.
4. Perform inference and validation.
## Pre-training
## Pre-training
For pre-training a model, config the options in `config.json` firstly:
For pre-training a model, config the options in `config.json` firstly:
- Assign the `pre_train_dataset` under `dataset_config` node to the dataset path.
- Assign the `pre_train_dataset` under `dataset_config` node to the dataset path.
- Choose the optimizer('momentum/adam/lamb' is available).
- Choose the optimizer('momentum/adam/lamb' is available).
- Assign the 'ckpt_prefix' and 'ckpt_path' under `checkpoint_path` to save the model files.
- Assign the 'ckpt_prefix' and 'ckpt_path' under `checkpoint_path` to save the model files.
@@ -524,7 +541,9 @@ sh run_gpu.sh -t t -n 1 -i 1 -c /mass/config/config.json
Get the log and output files under the path `./train_mass_*/`, and the model file under the path assigned in the `config/config.json` file.
Get the log and output files under the path `./train_mass_*/`, and the model file under the path assigned in the `config/config.json` file.
## Fine-tuning
## Fine-tuning
For fine-tuning a model, config the options in `config.json` firstly:
For fine-tuning a model, config the options in `config.json` firstly:
- Assign the `fine_tune_dataset` under `dataset_config` node to the dataset path.
- Assign the `fine_tune_dataset` under `dataset_config` node to the dataset path.
- Assign the `existed_ckpt` under `checkpoint_path` node to the existed model file generated by pre-training.
- Assign the `existed_ckpt` under `checkpoint_path` node to the existed model file generated by pre-training.
- Choose the optimizer('momentum/adam/lamb' is available).
- Choose the optimizer('momentum/adam/lamb' is available).
@@ -546,8 +565,10 @@ sh run_gpu.sh -t t -n 1 -i 1 -c config/config.json
Get the log and output files under the path `./train_mass_*/`, and the model file under the path assigned in the `config/config.json` file.
Get the log and output files under the path `./train_mass_*/`, and the model file under the path assigned in the `config/config.json` file.
## Inference
## Inference
If you need to use the trained model to perform inference on multiple hardware platforms, such as GPU, Ascend 910 or Ascend 310, you can refer to this [Link](https://www.mindspore.cn/tutorial/training/en/master/advanced_use/migrate_3rd_scripts.html).
If you need to use the trained model to perform inference on multiple hardware platforms, such as GPU, Ascend 910 or Ascend 310, you can refer to this [Link](https://www.mindspore.cn/tutorial/training/en/master/advanced_use/migrate_3rd_scripts.html).
For inference, config the options in `config.json` firstly:
For inference, config the options in `config.json` firstly:
- Assign the `test_dataset` under `dataset_config` node to the dataset path.
- Assign the `test_dataset` under `dataset_config` node to the dataset path.
- Assign the `existed_ckpt` under `checkpoint_path` node to the model file produced by fine-tuning.
- Assign the `existed_ckpt` under `checkpoint_path` node to the model file produced by fine-tuning.
- Choose the optimizer('momentum/adam/lamb' is available).
- Choose the optimizer('momentum/adam/lamb' is available).
@@ -571,7 +592,8 @@ sh run_gpu.sh -t i -n 1 -i 1 -c config/config.json -o {outputfile}
## Results
## Results
### Fine-Tuning on Text Summarization
### Fine-Tuning on Text Summarization
The comparisons between MASS and two other pre-training methods in terms of ROUGE score on the text summarization task
The comparisons between MASS and two other pre-training methods in terms of ROUGE score on the text summarization task
with 3.8M training data are as follows:
with 3.8M training data are as follows:
| Method | RG-1(F) | RG-2(F) | RG-L(F) |
| Method | RG-1(F) | RG-2(F) | RG-L(F) |
@@ -579,6 +601,7 @@ with 3.8M training data are as follows:
| MASS | Ongoing | Ongoing | Ongoing |
| MASS | Ongoing | Ongoing | Ongoing |
### Fine-Tuning on Conversational ResponseGeneration
### Fine-Tuning on Conversational ResponseGeneration
The comparisons between MASS and other baseline methods in terms of PPL on Cornell Movie Dialog corpus are as follows:
The comparisons between MASS and other baseline methods in terms of PPL on Cornell Movie Dialog corpus are as follows:
| Method | Data = 10K | Data = 110K |
| Method | Data = 10K | Data = 110K |
@@ -603,10 +626,6 @@ The comparisons between MASS and other baseline methods in terms of PPL on Corne
| Speed | 611.45 sentences/s |
| Speed | 611.45 sentences/s |
| Total time | --/-- |
| Total time | --/-- |
| Params (M) | 44.6M |
| Params (M) | 44.6M |
| Checkpoint for Fine tuning | ---Mb, --, [A link]() |
| Model for inference | ---Mb, --, [A link]() |
| Scripts | [A link]() |
### Inference Performance
### Inference Performance
@@ -622,17 +641,15 @@ The comparisons between MASS and other baseline methods in terms of PPL on Corne
| Accuracy | ppl=23.52 for conversation response, RG-1=29.79 for text summarization. |
| Accuracy | ppl=23.52 for conversation response, RG-1=29.79 for text summarization. |
| Speed | ---- sentences/s |
| Speed | ---- sentences/s |
| Total time | --/-- |
| Total time | --/-- |
| Model for inference | ---Mb, --, [A link]() |
# Description of random situation
# Description of random situation
MASS model contains dropout operations, if you want to disable dropout, please set related dropout_rate to 0 in `config/config.json`.
MASS model contains dropout operations, if you want to disable dropout, please set related dropout_rate to 0 in `config/config.json`.
# others
# others
The model has been validated on Ascend environment, not validated on CPU and GPU.
The model has been validated on Ascend and GPU environments, not validated on CPU.
[论文](https://www.microsoft.com/en-us/research/uploads/prod/2019/06/MASS-paper-updated-002.pdf): Song, Kaitao, Xu Tan, Tao Qin, Jianfeng Lu and Tie-Yan Liu.“MASS: Masked Sequence to Sequence Pre-training for Language Generation.”ICML (2019).