Are you sure you want to delete this task? Once this task is deleted, it cannot be recovered.
|
|
5 years ago | |
|---|---|---|
| .. | ||
| scripts | 5 years ago | |
| src | 5 years ago | |
| README.md | 5 years ago | |
| pretrain_eval.py | 5 years ago | |
| run_classifier.py | 5 years ago | |
| run_ner.py | 5 years ago | |
| run_pretrain.py | 5 years ago | |
| run_squad.py | 5 years ago | |
The BERT network was proposed by Google in 2018. The network has made a breakthrough in the field of NLP. The network uses pre-training to achieve a large network structure without modifying, and only by adding an output layer to achieve multiple text-based tasks in fine-tuning. The backbone code of BERT adopts the Encoder structure of Transformer. The attention mechanism is introduced to enable the output layer to capture high-latitude global semantic information. The pre-training uses denoising and self-encoding tasks, namely MLM(Masked Language Model) and NSP(Next Sentence Prediction). No need to label data, pre-training can be performed on massive text data, and only a small amount of data to fine-tuning downstream tasks to obtain good results. The pre-training plus fune-tuning mode created by BERT is widely adopted by subsequent NLP networks.
Paper: Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805.
Paper: Junqiu Wei, Xiaozhe Ren, Xiaoguang Li, Wenyong Huang, Yi Liao, Yasheng Wang, Jiashu Lin, Xin Jiang, Xiao Chen, Qun Liu. NEZHA: Neural Contextualized Representation for Chinese Language Understanding. arXiv preprint arXiv:1909.00204.
The backbone structure of BERT is transformer. For BERT_base, the transformer contains 12 encoder modules, each module contains one self-attention module and each self-attention module contains one attention module. For BERT_NEZHA, the transformer contains 24 encoder modules, each module contains one self-attention module and each self-attention module contains one attention module. The difference between BERT_base and BERT_NEZHA is that BERT_base uses absolute position encoding to produce position embedding vector and BERT_NEZHA uses relative position encoding.
After installing MindSpore via the official website, you can start pre-training, fine-tuning and evaluation as follows:
# run standalone pre-training example
bash scripts/run_standalone_pretrain_ascend.sh 0 1 /path/cn-wiki-128
# run distributed pre-training example
bash scripts/run_distributed_pretrain_ascend.sh /path/cn-wiki-128 /path/hccl.json
# run fine-tuning and evaluation example
- If you are going to run a fine-tuning task, please prepare a checkpoint generated from pre-training.
- Set bert network config and optimizer hyperparameters in `finetune_eval_config.py`.
- Classification task: Set task related hyperparameters in scripts/run_classifier.sh.
- Run `bash scripts/run_classifier.py` for fine-tuning of BERT-base and BERT-NEZHA model.
bash scripts/run_classifier.sh
- NER task: Set task related hyperparameters in scripts/run_ner.sh.
- Run `bash scripts/run_ner.py` for fine-tuning of BERT-base and BERT-NEZHA model.
bash scripts/run_ner.sh
- SQuAD task: Set task related hyperparameters in scripts/run_squad.sh.
- Run `bash scripts/run_squad.py` for fine-tuning of BERT-base and BERT-NEZHA model.
bash scripts/run_squad.sh
For distributed training, an hccl configuration file with JSON format needs to be created in advance.
Please follow the instructions in the link below:
https:gitee.com/mindspore/mindspore/tree/master/model_zoo/utils/hccl_tools.
For dataset, if you want to set the format and parameters, a schema configuration file with JSON format needs to be created, please refer to tfrecord format.
For pretraining, schema file contains ["input_ids", "input_mask", "segment_ids", "next_sentence_labels", "masked_lm_positions", "masked_lm_ids", "masked_lm_weights"].
For ner or classification task, schema file contains ["input_ids", "input_mask", "segment_ids", "label_ids"].
For squad task, training: schema file contains ["start_positions", "end_positions", "input_ids", "input_mask", "segment_ids"], evaluation: schema file contains ["input_ids", "input_mask", "segment_ids"].
`numRows` is the only option which could be set by user, other values must be set according to the dataset.
For example, the schema file of cn-wiki-128 dataset for pretraining shows as follows:
{
"datasetType": "TF",
"numRows": 7680,
"columns": {
"input_ids": {
"type": "int64",
"rank": 1,
"shape": [128]
},
"input_mask": {
"type": "int64",
"rank": 1,
"shape": [128]
},
"segment_ids": {
"type": "int64",
"rank": 1,
"shape": [128]
},
"next_sentence_labels": {
"type": "int64",
"rank": 1,
"shape": [1]
},
"masked_lm_positions": {
"type": "int64",
"rank": 1,
"shape": [20]
},
"masked_lm_ids": {
"type": "int64",
"rank": 1,
"shape": [20]
},
"masked_lm_weights": {
"type": "float32",
"rank": 1,
"shape": [20]
}
}
}
.
└─bert
├─README.md
├─scripts
├─ascend_distributed_launcher
├─__init__.py
├─hyper_parameter_config.ini # hyper paramter for distributed pretraining
├─run_distribute_pretrain.py # script for distributed pretraining
├─README.md
├─run_classifier.sh # shell script for standalone classifier task on ascend or gpu
├─run_ner.sh # shell script for standalone NER task on ascend or gpu
├─run_squad.sh # shell script for standalone SQUAD task on ascend or gpu
├─run_standalone_pretrain_ascend.sh # shell script for standalone pretrain on ascend
├─run_distributed_pretrain_ascend.sh # shell script for distributed pretrain on ascend
├─run_distributed_pretrain_gpu.sh # shell script for distributed pretrain on gpu
└─run_standaloned_pretrain_gpu.sh # shell script for distributed pretrain on gpu
├─src
├─__init__.py
├─assessment_method.py # assessment method for evaluation
├─bert_for_finetune.py # backbone code of network
├─bert_for_pre_training.py # backbone code of network
├─bert_model.py # backbone code of network
├─clue_classification_dataset_precess.py # data preprocessing
├─cluner_evaluation.py # evaluation for cluner
├─config.py # parameter configuration for pretraining
├─CRF.py # assessment method for clue dataset
├─dataset.py # data preprocessing
├─finetune_eval_config.py # parameter configuration for finetuning
├─finetune_eval_model.py # backbone code of network
├─sample_process.py # sample processing
├─utils.py # util function
├─pretrain_eval.py # train and eval net
├─run_classifier.py # finetune and eval net for classifier task
├─run_ner.py # finetune and eval net for ner task
├─run_pretrain.py # train net for pretraining phase
└─run_squad.py # finetune and eval net for squad task
usage: run_pretrain.py [--distribute DISTRIBUTE] [--epoch_size N] [----device_num N] [--device_id N]
[--enable_save_ckpt ENABLE_SAVE_CKPT] [--device_target DEVICE_TARGET]
[--enable_lossscale ENABLE_LOSSSCALE] [--do_shuffle DO_SHUFFLE]
[--enable_data_sink ENABLE_DATA_SINK] [--data_sink_steps N]
[--accumulation_steps N]
[--save_checkpoint_path SAVE_CHECKPOINT_PATH]
[--load_checkpoint_path LOAD_CHECKPOINT_PATH]
[--save_checkpoint_steps N] [--save_checkpoint_num N]
[--data_dir DATA_DIR] [--schema_dir SCHEMA_DIR] [train_steps N]
options:
--device_target device where the code will be implemented: "Ascend" | "GPU", default is "Ascend"
--distribute pre_training by serveral devices: "true"(training by more than 1 device) | "false", default is "false"
--epoch_size epoch size: N, default is 1
--device_num number of used devices: N, default is 1
--device_id device id: N, default is 0
--enable_save_ckpt enable save checkpoint: "true" | "false", default is "true"
--enable_lossscale enable lossscale: "true" | "false", default is "true"
--do_shuffle enable shuffle: "true" | "false", default is "true"
--enable_data_sink enable data sink: "true" | "false", default is "true"
--data_sink_steps set data sink steps: N, default is 1
--accumulation_steps accumulate gradients N times before weight update: N, default is 1
--save_checkpoint_path path to save checkpoint files: PATH, default is ""
--load_checkpoint_path path to load checkpoint files: PATH, default is ""
--save_checkpoint_steps steps for saving checkpoint files: N, default is 1000
--save_checkpoint_num number for saving checkpoint files: N, default is 1
--train_steps Training Steps: N, default is -1
--data_dir path to dataset directory: PATH, default is ""
--schema_dir path to schema.json file, PATH, default is ""
usage: run_ner.py [--device_target DEVICE_TARGET] [--do_train DO_TRAIN] [----do_eval DO_EVAL]
[--assessment_method ASSESSMENT_METHOD] [--use_crf USE_CRF]
[--device_id N] [--epoch_num N] [--vocab_file_path VOCAB_FILE_PATH]
[--label2id_file_path LABEL2ID_FILE_PATH]
[--train_data_shuffle TRAIN_DATA_SHUFFLE]
[--eval_data_shuffle EVAL_DATA_SHUFFLE]
[--save_finetune_checkpoint_path SAVE_FINETUNE_CHECKPOINT_PATH]
[--load_pretrain_checkpoint_path LOAD_PRETRAIN_CHECKPOINT_PATH]
[--train_data_file_path TRAIN_DATA_FILE_PATH]
[--eval_data_file_path EVAL_DATA_FILE_PATH]
[--schema_file_path SCHEMA_FILE_PATH]
options:
--device_target device where the code will be implemented: "Ascend" | "GPU", default is "Ascend"
--do_train whether to run training on training set: true | false
--do_eval whether to run eval on dev set: true | false
--assessment_method assessment method to do evaluation: f1 | clue_benchmark
--use_crf whether to use crf to calculate loss: true | false
--device_id device id to run task
--epoch_num total number of training epochs to perform
--num_class number of classes to do labeling
--train_data_shuffle Enable train data shuffle, default is true
--eval_data_shuffle Enable eval data shuffle, default is true
--vocab_file_path the vocabulary file that the BERT model was trained on
--label2id_file_path label to id json file
--save_finetune_checkpoint_path path to save generated finetuning checkpoint
--load_pretrain_checkpoint_path initial checkpoint (usually from a pre-trained BERT model)
--load_finetune_checkpoint_path give a finetuning checkpoint path if only do eval
--train_data_file_path ner tfrecord for training. E.g., train.tfrecord
--eval_data_file_path ner tfrecord for predictions if f1 is used to evaluate result, ner json for predictions if clue_benchmark is used to evaluate result
--schema_file_path path to datafile schema file
usage: run_squad.py [--device_target DEVICE_TARGET] [--do_train DO_TRAIN] [----do_eval DO_EVAL]
[--device_id N] [--epoch_num N] [--num_class N]
[--vocab_file_path VOCAB_FILE_PATH]
[--eval_json_path EVAL_JSON_PATH]
[--train_data_shuffle TRAIN_DATA_SHUFFLE]
[--eval_data_shuffle EVAL_DATA_SHUFFLE]
[--save_finetune_checkpoint_path SAVE_FINETUNE_CHECKPOINT_PATH]
[--load_pretrain_checkpoint_path LOAD_PRETRAIN_CHECKPOINT_PATH]
[--load_finetune_checkpoint_path LOAD_FINETUNE_CHECKPOINT_PATH]
[--train_data_file_path TRAIN_DATA_FILE_PATH]
[--eval_data_file_path EVAL_DATA_FILE_PATH]
[--schema_file_path SCHEMA_FILE_PATH]
options:
--device_target device where the code will be implemented: "Ascend" | "GPU", default is "Ascend"
--do_train whether to run training on training set: true | false
--do_eval whether to run eval on dev set: true | false
--device_id device id to run task
--epoch_num total number of training epochs to perform
--num_class number of classes to classify, usually 2 for squad task
--train_data_shuffle Enable train data shuffle, default is true
--eval_data_shuffle Enable eval data shuffle, default is true
--vocab_file_path the vocabulary file that the BERT model was trained on
--eval_json_path path to squad dev json file
--save_finetune_checkpoint_path path to save generated finetuning checkpoint
--load_pretrain_checkpoint_path initial checkpoint (usually from a pre-trained BERT model)
--load_finetune_checkpoint_path give a finetuning checkpoint path if only do eval
--train_data_file_path squad tfrecord for training. E.g., train1.1.tfrecord
--eval_data_file_path squad tfrecord for predictions. E.g., dev1.1.tfrecord
--schema_file_path path to datafile schema file
usage: run_classifier.py [--device_target DEVICE_TARGET] [--do_train DO_TRAIN] [----do_eval DO_EVAL]
[--assessment_method ASSESSMENT_METHOD] [--device_id N] [--epoch_num N] [--num_class N]
[--save_finetune_checkpoint_path SAVE_FINETUNE_CHECKPOINT_PATH]
[--load_pretrain_checkpoint_path LOAD_PRETRAIN_CHECKPOINT_PATH]
[--load_finetune_checkpoint_path LOAD_FINETUNE_CHECKPOINT_PATH]
[--train_data_shuffle TRAIN_DATA_SHUFFLE]
[--eval_data_shuffle EVAL_DATA_SHUFFLE]
[--train_data_file_path TRAIN_DATA_FILE_PATH]
[--eval_data_file_path EVAL_DATA_FILE_PATH]
[--schema_file_path SCHEMA_FILE_PATH]
options:
--device_target targeted device to run task: Ascend | GPU
--do_train whether to run training on training set: true | false
--do_eval whether to run eval on dev set: true | false
--assessment_method assessment method to do evaluation: accuracy | f1 | mcc | spearman_correlation
--device_id device id to run task
--epoch_num total number of training epochs to perform
--num_class number of classes to do labeling
--train_data_shuffle Enable train data shuffle, default is true
--eval_data_shuffle Enable eval data shuffle, default is true
--save_finetune_checkpoint_path path to save generated finetuning checkpoint
--load_pretrain_checkpoint_path initial checkpoint (usually from a pre-trained BERT model)
--load_finetune_checkpoint_path give a finetuning checkpoint path if only do eval
--train_data_file_path tfrecord for training. E.g., train.tfrecord
--eval_data_file_path tfrecord for predictions. E.g., dev.tfrecord
--schema_file_path path to datafile schema file
Parameters for training and evaluation can be set in file config.py and finetune_eval_config.py respectively.
config for lossscale and etc.
bert_network version of BERT model: base | nezha, default is base
loss_scale_value initial value of loss scale: N, default is 2^32
scale_factor factor used to update loss scale: N, default is 2
scale_window steps for once updatation of loss scale: N, default is 1000
optimizer optimizer used in the network: AdamWerigtDecayDynamicLR | Lamb | Momentum, default is "Lamb"
Parameters for dataset and network (Pre-Training/Fine-Tuning/Evaluation):
batch_size batch size of input dataset: N, default is 16
seq_length length of input sequence: N, default is 128
vocab_size size of each embedding vector: N, must be consistant with the dataset you use. Default is 21136
hidden_size size of bert encoder layers: N, default is 768
num_hidden_layers number of hidden layers: N, default is 12
num_attention_heads number of attention heads: N, default is 12
intermediate_size size of intermediate layer: N, default is 3072
hidden_act activation function used: ACTIVATION, default is "gelu"
hidden_dropout_prob dropout probability for BertOutput: Q, default is 0.1
attention_probs_dropout_prob dropout probability for BertAttention: Q, default is 0.1
max_position_embeddings maximum length of sequences: N, default is 512
type_vocab_size size of token type vocab: N, default is 16
initializer_range initialization value of TruncatedNormal: Q, default is 0.02
use_relative_positions use relative positions or not: True | False, default is False
input_mask_from_dataset use the input mask loaded form dataset or not: True | False, default is True
token_type_ids_from_dataset use the token type ids loaded from dataset or not: True | False, default is True
dtype data type of input: mstype.float16 | mstype.float32, default is mstype.float32
compute_type compute type in BertTransformer: mstype.float16 | mstype.float32, default is mstype.float16
Parameters for optimizer:
AdamWeightDecay:
decay_steps steps of the learning rate decay: N
learning_rate value of learning rate: Q
end_learning_rate value of end learning rate: Q, must be positive
power power: Q
warmup_steps steps of the learning rate warm up: N
weight_decay weight decay: Q
eps term added to the denominator to improve numerical stability: Q
Lamb:
decay_steps steps of the learning rate decay: N
learning_rate value of learning rate: Q
end_learning_rate value of end learning rate: Q
power power: Q
warmup_steps steps of the learning rate warm up: N
weight_decay weight decay: Q
Momentum:
learning_rate value of learning rate: Q
momentum momentum for the moving average: Q
bash scripts/run_standalone_pretrain_ascend.sh 0 1 /path/cn-wiki-128
The command above will run in the background, you can view training logs in pretraining_log.txt. After training finished, you will get some checkpoint files under the script folder by default. The loss values will be displayed as follows:
# grep "epoch" pretraining_log.txt
epoch: 0.0, current epoch percent: 0.000, step: 1, outpus are (Tensor(shape=[1], dtype=Float32, [ 1.0856101e+01]), Tensor(shape=[], dtype=Bool, False), Tensor(shape=[], dtype=Float32, 65536))
epoch: 0.0, current epoch percent: 0.000, step: 2, outpus are (Tensor(shape=[1], dtype=Float32, [ 1.0821701e+01]), Tensor(shape=[], dtype=Bool, False), Tensor(shape=[], dtype=Float32, 65536))
...
bash scripts/run_distributed_pretrain_ascend.sh /path/cn-wiki-128 /path/hccl.json
The command above will run in the background, you can view training logs in pretraining_log.txt. After training finished, you will get some checkpoint files under the LOG* folder by default. The loss value will be displayed as follows:
# grep "epoch" LOG*/pretraining_log.txt
epoch: 0.0, current epoch percent: 0.001, step: 100, outpus are (Tensor(shape=[1], dtype=Float32, [ 1.08209e+01]), Tensor(shape=[], dtype=Bool, False), Tensor(shape=[], dtype=Float32, 65536))
epoch: 0.0, current epoch percent: 0.002, step: 200, outpus are (Tensor(shape=[1], dtype=Float32, [ 1.07566e+01]), Tensor(shape=[], dtype=Bool, False), Tensor(shape=[], dtype=Float32, 65536))
...
epoch: 0.0, current epoch percent: 0.001, step: 100, outpus are (Tensor(shape=[1], dtype=Float32, [ 1.08218e+01]), Tensor(shape=[], dtype=Bool, False), Tensor(shape=[], dtype=Float32, 65536))
epoch: 0.0, current epoch percent: 0.002, step: 200, outpus are (Tensor(shape=[1], dtype=Float32, [ 1.07770e+01]), Tensor(shape=[], dtype=Bool, False), Tensor(shape=[], dtype=Float32, 65536))
...
Before running the command below, please check the load pretrain checkpoint path has been set. Please set the checkpoint path to be the absolute full path, e.g:"/username/pretrain/checkpoint_100_300.ckpt".
bash scripts/run_classifier.sh
The command above will run in the background, you can view training logs in classfier_log.txt.
If you choose accuracy as assessment method, the result will be as follows:
acc_num XXX, total_num XXX, accuracy 0.588986
bash scripts/ner.sh
The command above will run in the background, you can view training logs in ner_log.txt.
If you choose F1 as assessment method, the result will be as follows:
Precision 0.920507
Recall 0.948683
F1 0.920507
bash scripts/squad.sh
The command above will run in the background, you can view training logs in squad_log.txt.
The result will be as follows:
{"exact_match": 80.3878923040233284, "f1": 87.6902384023850329}
| Parameters | Ascend | GPU |
|---|---|---|
| Model Version | BERT_base | BERT_base |
| Resource | Ascend 910, cpu:2.60GHz 56cores, memory:314G | NV SMX2 V100-32G |
| uploaded Date | 08/22/2020 | 05/06/2020 |
| MindSpore Version | 0.6.0 | 0.3.0 |
| Dataset | cn-wiki-128(4000w) | ImageNet |
| Training Parameters | src/config.py | src/config.py |
| Optimizer | Lamb | Momentum |
| Loss Function | SoftmaxCrossEntropy | SoftmaxCrossEntropy |
| outputs | probability | |
| Epoch | 40 | |
| Batch_size | 256*8 | 130(8P) |
| Loss | 1.7 | 1.913 |
| Speed | 340ms/step | 1.913 |
| Total time | 73h | |
| Params (M) | 110M | |
| Checkpoint for Fine tuning | 1.2G(.ckpt file) |
| Parameters | Ascend | GPU |
|---|---|---|
| Model Version | BERT_NEZHA | BERT_NEZHA |
| Resource | Ascend 910, cpu:2.60GHz 56cores, memory:314G | NV SMX2 V100-32G |
| uploaded Date | 08/20/2020 | 05/06/2020 |
| MindSpore Version | 0.6.0 | 0.3.0 |
| Dataset | cn-wiki-128(4000w) | ImageNet |
| Training Parameters | src/config.py | src/config.py |
| Optimizer | Lamb | Momentum |
| Loss Function | SoftmaxCrossEntropy | SoftmaxCrossEntropy |
| outputs | probability | |
| Epoch | 40 | |
| Batch_size | 96*8 | 130(8P) |
| Loss | 1.7 | 1.913 |
| Speed | 360ms/step | 1.913 |
| Total time | 200h | |
| Params (M) | 340M | |
| Checkpoint for Fine tuning | 3.2G(.ckpt file) |
| Parameters | Ascend | GPU |
|---|---|---|
| Model Version | ||
| Resource | Ascend 910 | NV SMX2 V100-32G |
| uploaded Date | 08/22/2020 | 05/22/2020 |
| MindSpore Version | 0.6.0 | 0.2.0 |
| Dataset | cola, 1.2W | ImageNet, 1.2W |
| batch_size | 32(1P) | 130(8P) |
| Accuracy | 0.588986 | ACC1[72.07%] ACC5[90.90%] |
| Speed | 59.25ms/step | |
| Total time | 15min | |
| Model for inference | 1.2G(.ckpt file) |
In run_standalone_pretrain.sh and run_distributed_pretrain.sh, we set do_shuffle to True to shuffle the dataset by default.
In run_classifier.sh, run_ner.sh and run_squad.sh, we set train_data_shuffle and eval_data_shuffle to True to shuffle the dataset by default.
In config.py, we set the hidden_dropout_prob and attention_pros_dropout_prob to 0.1 to dropout some network node by default.
In run_pretrain.py, we set a random seed to make sure that each node has the same initial weight in distribute training.
Please check the official homepage.
MindSpore is a new open source deep learning training/inference framework that could be used for mobile, edge and cloud scenarios.
C++ Python Text Unity3D Asset C other