Add GPT

5 years ago · 837f1a160c
--- a/model_zoo/official/nlp/gpt/README.md
+++ b/model_zoo/official/nlp/gpt/README.md
@@ -0,0 +1,76 @@
 # It is still under development.
 # Contents
 - [Contents](#contents)
 - [GPT Description](#bert-description)
 - [Model Architecture](#model-architecture)
 - [Dataset](#dataset)
 - [Environment Requirements](#environment-requirements)
 - [Quick Start](#quick-start)
 - [Script Description](#script-description)
  - [Script and Sample Code](#script-and-sample-code)
 - [ModelZoo Homepage](#modelzoo-homepage)
 # [GPT Description](#contents)
 The GPT network was proposed by OpenAI and it has three versions, i.e., GPT, GPT2 and GPT3. The newest version GPT3 was proposed in Jul 2020 and it is quite a large language model with 175 billion parameters. Stacking many Decoder structure of Transformer and feeding massive amount of training data, GPT3 becomes such a powerful language model that no fine-tuning process is needed. As the papre title says, language models are few-shot learners, GPT3 proves that with a large and well-trained model, we can achieve a similar performance compared to those of fine-tuning methods.
 [Paper](https://arxiv.org/abs/2005.14165):  Tom B.Brown, Benjamin Mann, Nick Ryder et al. [Language Models are Few-Shot Learners]((https://arxiv.org/abs/2005.14165)). arXiv preprint arXiv:2005.14165
 # [Model Architecture](#contents)
 GPT3 stacks many layers of decoder of transformer. According to the layer numbers and embedding size, GPT3 has several versions. The largest model contains 96 layers with embedding size of 12288 resulting to a total parameter of 175 billion. 
 # [Dataset](#contents)
 - OpenWebText is utilized as the training data and the training objective is to predict the next token at each position.
 # [Environment Requirements](#contents)
 - Hardware（Ascend）
  - Prepare hardware environment with Ascend processor. If you want to try Ascend, please send the [application form](https://obs-9be7.obs.cn-east-2.myhuaweicloud.com/file/other/Ascend%20Model%20Zoo%E4%BD%93%E9%AA%8C%E8%B5%84%E6%BA%90%E7%94%B3%E8%AF%B7%E8%A1%A8.docx) to ascend@huawei.com. Once approved, you can get access to the resources. 
 - Framework
  - [MindSpore](https://gitee.com/mindspore/mindspore)
 - For more information, please check the resources below：
  - [MindSpore Tutorials](https://www.mindspore.cn/tutorial/training/en/master/index.html)
  - [MindSpore Python API](https://www.mindspore.cn/doc/api_python/en/master/index.html)
 # [Quick Start](#contents)
 After installing MindSpore via the official website, you can start training and evaluation as follows:
 ```bash
 # run standalone training example
 bash scripts/run_standalone_train.sh 0 10 /path/dataset
 # run distributed training example
 bash scripts/run_distribute_training.sh /path/dataset /path/hccl.json 8
 # run evaluation example, now only accuracy and perplexity for lambada and wikitext103 are supported
 bash scripts/run_evaluation.sh lambada /your/ckpt /your/data acc
 ```
 For distributed training, an hccl configuration file with JSON format needs to be created in advance.
 Please follow the instructions in the link below:
 https:gitee.com/mindspore/mindspore/tree/master/model_zoo/utils/hccl_tools.
 # [Script Description](#contents)
 ## [Script and Sample Code](#contents)
 ```shell
 .
 └─gpt
  ├─README.md
  ├─scripts
    ├─run_standalone_train.sh                 # shell script for standalone training on ascend
    ├─run_distribut_train.sh                  # shell script for distributed training on ascend
    └─run_evaluation.sh                       # shell script for evaluation of ascend
  ├─src
    ├─gpt_wrapper.py                          # backbone code of network
    ├─gpt.py                                  # backbone code of network
    ├─dataset.py                              # data preprocessing
    ├─inference.py                            # evaluation function
    ├─utils.py                                # util function
  ├─train.py                                  # train net for training phase
  └─eval.py                                   # eval net for evaluation
 ```
 # [ModelZoo Homepage](#contents)
 Please check the official [homepage](https://gitee.com/mindspore/mindspore/tree/master/model_zoo). 
--- a/model_zoo/official/nlp/gpt/eval.py
+++ b/model_zoo/official/nlp/gpt/eval.py
@@ -0,0 +1,155 @@
 # Copyright 2020 Huawei Technologies Co., Ltd
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 # http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ============================================================================
 """
 GPT evaluation script.
 """
 import math
 import argparse
 import numpy as np
 from mindspore import context
 import mindspore.common.dtype as mstype
 from mindspore.common.tensor import Tensor
 from mindspore.train.serialization import load_checkpoint, load_param_into_net
 from src.inference import generate
 from src.dataset import create_dataset
 from src.gpt import GPT, EvalNet, GPTWithLoss, CrossEntropyLoss
 from src.utils import GPTConfig
 context.set_context(mode=context.GRAPH_MODE)
 def ppl_score(probs, length, is_logsoftmax=True):
    """ calculate perplexity with prob or log_prob inputs """
    probs = probs[:length]
    if is_logsoftmax:
        prob = np.sum(probs) / length
        ppl = 1.0 / np.power(np.e, prob)
    else:
        prob = 1.0
        for p in probs:
            prob *= (1.0 / p)
        ppl = np.power(prob, 1.0/length)
    return ppl
 def get_ppl(model, dataset):
    """ calculate perplexity for input dataset """
    PPL = []
    tokens = 0
    for data in dataset:
        data = data[0].asnumpy()
        input_ids = data
        logits = model(Tensor(input_ids, mstype.int32)).asnumpy()
        PPL.append(logits * len(data))
        tokens += len(data)
    val_loss = sum(PPL) / tokens
    ppl = math.exp(min(20, val_loss))
    return ppl
 def get_acc(model, dataset):
    """ calculate accuracy for input dataset """
    total_num = 0
    acc_num = 0
    for data in dataset:
        data = data[0].asnumpy()
        input_mask = (data != 0).astype(np.int32)
        length = np.sum(input_mask, 1)
        label = np.zeros(length.shape)
        for i, idx in enumerate(length):
            label[i] = data[i][idx-1]
            input_mask[i][idx-1] = 0
            data[i][idx-1] = 0
        length = np.sum(data != 50256, 1)
        input_ids = data
        logits = model(Tensor(input_ids, mstype.int32)).asnumpy()
        logits = logits.reshape(len(length), -1)
        predicted_label = np.zeros(length.shape)
        for i, idx in enumerate(length):
            predicted_label[i] = logits[i][idx-2]
        total_num += len(label)
        acc_num += sum(label == predicted_label)
    acc = acc_num / total_num
    return acc
 def run_eval():
    """ evaluate scripts """
    parser = argparse.ArgumentParser(description="GPT inferencing")
    parser.add_argument('--task_type', type=str, default="", help="Evaluation task.")
    parser.add_argument('--metrics', type=str, default="acc", choices=["ppl", "acc"], help="Evaluation metrics.")
    parser.add_argument('--ckpt_path', type=str, default="", help="path of checkpoint file.")
    parser.add_argument('--data_path', type=str, default="", help="path of MindRecord file.")
    args = parser.parse_args()
    task = args.task_type
    metrics = args.metrics
    ckpt_path = args.ckpt_path
    if task not in ["generate", "lambada", "wikitext"]:
        raise ValueError("{} is not supported now".format(task))
    if metrics not in ["acc", "ppl"]:
        raise ValueError("{} is not supported now".format(metrics))
    config = GPTConfig(batch_size=16,
                       seq_length=1024,
                       vocab_size=50257,
                       embedding_size=1024,
                       num_layers=24,
                       num_heads=16,
                       expand_ratio=4,
                       post_layernorm_residual=False,
                       dropout_rate=0.0,
                       compute_dtype=mstype.float16,
                       use_past=False)
    ckpt_dict = load_checkpoint(ckpt_path)
    gpt = GPT(config)
    if task == "generate":
        gpt_eval = EvalNet(gpt, generate=True)
    elif metrics == "acc":
        gpt_eval = EvalNet(gpt, generate=False)
    else:
        loss = CrossEntropyLoss(config)
        gpt_eval = GPTWithLoss(gpt, loss)
    gpt_eval.set_train(False)
    load_param_into_net(gpt_eval, ckpt_dict)
    if task == "generate":
        start_sentence = [6170, 318, 257]
        input_ids = np.array(start_sentence).reshape(1, -1)
        outputs = generate(gpt_eval, input_ids, config.seq_length)
        output_list = outputs.tolist()
        print("output id is ", output_list)
    else:
        data_path = args.data_path
        eval_dataset = create_dataset(config.batch_size, data_path=data_path, drop=False)
        if metrics == "acc":
            acc = get_acc(gpt_eval, eval_dataset)
            print("Accuracy is ", acc)
        elif metrics == "ppl":
            ppl = get_ppl(gpt_eval, eval_dataset)
            print("Perplexity is ", ppl)
 if __name__ == "__main__":
    run_eval()
--- a/model_zoo/official/nlp/gpt/scripts/run_distribute_train.sh
+++ b/model_zoo/official/nlp/gpt/scripts/run_distribute_train.sh
@@ -0,0 +1,38 @@
 #!/bin/bash
 # Copyright 2020 Huawei Technologies Co., Ltd
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 # http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ============================================================================
 echo "=============================================================================================================="
 echo "Please run the scipt as: "
 echo "bash run_distributed_pretrain_ascend.sh DATA_DIR RANK_TABLE_FILE DEVICE_NUM"
 echo "for example: bash run_distributed_pretrain_ascend.sh /path/dataset /path/hccl.json 8"
 echo "It is better to use absolute path."
 echo "=============================================================================================================="
 ROOT_PATH='pwd'
 DATA_DIR=$1
 export RANK_TABLE_FILE=$2
 RANK_SIZE=$3
 for((i=0;i<=${RANK_SIZE};i++));
 do
    rm ${ROOT_PATH}/device$i/ -rf
    mkdir ${ROOT_PATH}/device$i
    cd ${ROOT_PATH}/device$i || exit
    export RANK_ID=$i
    export DEVICE_ID=$i
    python ${ROOT_PATH}/train.py --distribute=true --device_num=$RANK_SIZE --data_path=$DATA_DIR >log$i.log 2>&1 &
 done
--- a/model_zoo/official/nlp/gpt/scripts/run_evaluation.sh
+++ b/model_zoo/official/nlp/gpt/scripts/run_evaluation.sh
@@ -0,0 +1,33 @@
 #!/bin/bash
 # Copyright 2020 Huawei Technologies Co., Ltd
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 # http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ============================================================================
 echo "=============================================================================================================="
 echo "Please run the scipt as: "
 echo "bash scripts/run_evaluation.sh TASK_TYPE CKPT_PATH DATA_PATH METRICS"
 echo "for example: bash scripts/run_evaluation.sh lambada /your/ckpt /your/data acc"
 echo "=============================================================================================================="
 TASK_TYPE=$1
 CKPT_PATH=$2
 DATA_PATH=$3
 METRICS=$4
 python eval.py \
    --task_type=$TASK_TYPE \
    --ckpt_path=$CKPT_PATH \
    --data_path=$DATA_PATH \
    --metrics=$METRICS
--- a/model_zoo/official/nlp/gpt/scripts/run_standalone_train.sh
+++ b/model_zoo/official/nlp/gpt/scripts/run_standalone_train.sh
@@ -0,0 +1,33 @@
 #!/bin/bash
 # Copyright 2020 Huawei Technologies Co., Ltd
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 # http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ============================================================================
 echo "=============================================================================================================="
 echo "Please run the scipt as: "
 echo "bash run_standalone_pretrain_ascend.sh DEVICE_ID EPOCH_SIZE DATA_DIR"
 echo "for example: bash run_standalone_pretrain_ascend.sh 0 40 /path/zh-wiki/"
 echo "=============================================================================================================="
 DEVICE_ID=$1
 EPOCH_SIZE=$2
 DATA_DIR=$3
 python train.py  \
    --distribute="false" \
    --epoch_size=$EPOCH_SIZE \
    --device_id=$DEVICE_ID \
    --data_path=$DATA_DIR \
    --optimizer="adam" > training_log.txt 2>&1 &
--- a/model_zoo/official/nlp/gpt/src/dataset.py
+++ b/model_zoo/official/nlp/gpt/src/dataset.py
@@ -0,0 +1,48 @@
 # Copyright 2020 Huawei Technologies Co., Ltd
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 # http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ============================================================================
 """
 Create dataset for training and evaluting
 """
 import os
 import mindspore.dataset as ds
 import mindspore.dataset.transforms.c_transforms as C
 import mindspore.common.dtype as mstype
 def create_dataset(batch_size, data_path, device_num=1, rank=0, drop=True):
    """
    Create dataset
    Inputs:
        batch_size: batch size
        data_path: path of your MindRecord files
        device_num: total device number
        rank: current rank id
        drop: whether drop remainder
    Returns:
        dataset: the dataset for training or evaluating
    """
    home_path = os.path.join(os.getcwd(), data_path)
    data = [os.path.join(home_path, name) for name in os.listdir(data_path) if name.endswith("mindrecord")]
    print(data)
    dataset = ds.MindDataset(data, columns_list=["input_ids"], shuffle=True, num_shards=device_num, shard_id=rank)
    type_cast_op = C.TypeCast(mstype.int32)
    dataset = dataset.map(input_columns="input_ids", operations=type_cast_op)
    dataset = dataset.batch(batch_size, drop_remainder=drop)
    dataset = dataset.repeat(1)
    return dataset
--- a/model_zoo/official/nlp/gpt/src/gpt.py
+++ b/model_zoo/official/nlp/gpt/src/gpt.py
@@ -0,0 +1,545 @@
 # Copyright 2020 Huawei Technologies Co., Ltd
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 # http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ============================================================================
 """GPT model"""
 import math
 import numpy as np
 import mindspore.nn as nn
 from mindspore.common.tensor import Tensor
 from mindspore.common.parameter import Parameter
 import mindspore.common.dtype as mstype
 from mindspore.common.initializer import TruncatedNormal, initializer
 from mindspore.ops import operations as P
 from mindspore.ops import functional as F
 class Mapping(nn.Cell):
    """
    A mapping function with a 3d input
    Args:
        input_size: the size of the last dimension of the input tensor
        output_size: the desired size of the last dimension of the output tensor
        dtype: the compute datatype
        scale: the scale factor for initialization
    Inputs:
        x: the 3d input
    Returns:
        output: Tensor, a 3d tensor after projection
    """
    def __init__(self, input_size, output_size, dtype, scale=1.0):
        super(Mapping, self).__init__()
        self.output_size = output_size
        self.input_size = input_size
        weight = np.random.normal(loc=0.0, scale=0.02*scale, size=(input_size, output_size))
        bias = np.zeros(shape=(output_size,))
        self.weight = Parameter(Tensor(weight, mstype.float32), name="mapping_weight")
        self.bias = Parameter(Tensor(bias, mstype.float32), name="mapping_bias")
        self.dtype = dtype
        self.cast = P.Cast()
    def construct(self, x):
        out_shape = P.Shape()(x)[:-1] + (self.output_size,)
        x = P.Reshape()(x, (-1, self.input_size))
        x = nn.MatMul()(x, self.cast(self.weight, self.dtype)) + self.cast(self.bias, self.dtype)
        output = P.Reshape()(x, out_shape)
        return output
 class Output(nn.Cell):
    """
    The output mapping module for each layer
    Args:
        config(GPTConfig): the config of network
        scale: scale factor for initialization
    Inputs:
        x: output of the self-attention module
    Returns:
        output: Tensor, the output of this layer after mapping
    """
    def __init__(self, config, scale=1.0):
        super(Output, self).__init__()
        input_size = config.embedding_size
        output_size = config.embedding_size*config.expand_ratio
        self.mapping = Mapping(input_size, output_size, config.compute_dtype)
        self.projection = Mapping(output_size, input_size, config.compute_dtype, scale)
        self.activation = nn.GELU()
        self.dropout = nn.Dropout(1-config.dropout_rate)
    def construct(self, x):
        hidden = self.activation(self.mapping(x))
        output = self.projection(hidden)
        output = self.dropout(output)
        return output
 class AttentionMask(nn.Cell):
    """
    Get the attention matrix for self-attention module
    Args:
        config(GPTConfig): the config of network
    Inputs:
        input_mask: the mask indicating whether each position is a valid input
    Returns:
        attention_mask: the attention mask matrix with shape (batch_size, 1, seq_length, seq_length)
    """
    def __init__(self, config):
        super(AttentionMask, self).__init__()
        self.reshape = P.Reshape()
        self.mul = P.BatchMatMul()
        ones = np.ones(shape=(config.seq_length, config.seq_length))
        self.lower_triangle_mask = Tensor(np.tril(ones), mstype.float32)
        self.multiply = P.Mul()
    def construct(self, input_mask):
        input_shape = P.Shape()(input_mask)
        shape_right = (input_shape[0], 1, input_shape[1])
        shape_left = input_shape + (1,)
        mask_left = self.reshape(input_mask, shape_left)
        mask_right = self.reshape(input_mask, shape_right)
        attention_mask = self.mul(mask_left, mask_right)
        lower_traiangle = P.ExpandDims()(self.lower_triangle_mask, 0)
        attention_mask = self.multiply(attention_mask, lower_traiangle)  #bs seq_length seq_length
        return attention_mask
 class EmbeddingLookup(nn.Cell):
    """
    The embedding lookup table for vocabulary
    Args:
        config(GPTConfig): the config of network
    Inputs:
        input_ids: the tokenized inputs with datatype int32
    Returns:
        output: Tensor, the embedding vector for the input with shape (batch_size, seq_length, embedding_size)
        self.embedding_table: Tensor, the embedding table for the vocabulary
    """
    def __init__(self, config):
        super(EmbeddingLookup, self).__init__()
        self.vocab_size = config.vocab_size
        self.embedding_size = config.embedding_size
        self.embedding_table = Parameter(initializer(TruncatedNormal(0.02), [self.vocab_size, self.embedding_size]),
                                         name="embedding_table")
        self.gather = P.GatherV2()
        self.shape = (-1, config.seq_length, config.embedding_size)
    def construct(self, input_ids):
        output = self.gather(self.embedding_table, input_ids, 0)
        return output, self.embedding_table
 class Attention(nn.Cell):
    """
    Self-Attention module for each layer
    Args:
        config(GPTConfig): the config of network
        scale: scale factor for initialization
        layer_idx: current layer index
    """
    def __init__(self, config, scale=1.0, layer_idx=None):
        super(Attention, self).__init__()
        self.get_attention_mask = AttentionMask(config)
        self.expand_mapping = Mapping(config.embedding_size, 3*config.embedding_size, config.compute_dtype)
        self.projection = Mapping(config.embedding_size, config.embedding_size, config.compute_dtype, scale)
        self.split = P.Split(axis=-1, output_num=3)
        self.transpose = P.Transpose()
        self.reshape = P.Reshape()
        self.n_head = config.num_heads
        self.size_per_head = config.embedding_size // self.n_head
        self.concat_k = P.Concat(axis=3)
        self.concat_v = P.Concat(axis=2)
        self.multiply_data = Tensor([-10000.0,], dtype=mstype.float32)
        self.batch_matmul = P.BatchMatMul()
        self.scale = scale
        if self.scale:
            self.scale_factor = Tensor(math.sqrt(self.size_per_head))
        if layer_idx is not None:
            self.coeff = math.sqrt(layer_idx * math.sqrt(self.size_per_head))
            self.coeff = Tensor(self.coeff)
        self.use_past = config.use_past
        self.dropout = nn.Dropout(1-config.dropout_rate)
        self.prob_dropout = nn.Dropout(1-config.dropout_rate)
        self.softmax = nn.Softmax()
        self.dense1 = nn.Dense(config.embedding_size, config.embedding_size).to_float(config.compute_dtype)
        self.dense2 = nn.Dense(config.embedding_size, config.embedding_size).to_float(config.compute_dtype)
        self.dense3 = nn.Dense(config.embedding_size, config.embedding_size).to_float(config.compute_dtype)
    def construct(self, x, attention_mask, layer_past=None):
        """
        self-attention
        Inputs:
            x: output of previous layer
            attention_mask: the attention mask matrix with shape (batch_size, 1, seq_length, seq_length)
            layer_past: the previous feature map
        Returns:
            output: Tensor, the output logit of this layer
            layer_present: Tensor, the feature map of current layer
        """
        original_shape = F.shape(x)
        x = F.reshape(x, (-1, original_shape[-1]))
        query = self.dense1(x)
        key = self.dense2(x)
        value = self.dense3(x)
        query = self.transpose(F.reshape(query, (-1, original_shape[1], self.n_head, self.size_per_head)), (0, 2, 1, 3))
        key = self.transpose(F.reshape(key, (-1, original_shape[1], self.n_head, self.size_per_head)), (0, 2, 3, 1))
        value = self.transpose(F.reshape(value, (-1, original_shape[1], self.n_head, self.size_per_head)), (0, 2, 1, 3))
        if self.use_past:
            past_value = layer_past[1]
            past_key = self.transpose(layer_past[0], (0, 1, 3, 2))
            key = self.concat_k((past_key, key))
            value = self.concat_v(past_value, value)
        layer_present = P.Pack()([self.transpose(key, (0, 1, 3, 2)), value])
        attention = self._attn(query, key, value, attention_mask)
        attention_merge = self.merge_heads(attention)
        output = self.projection(attention_merge)
        output = self.dropout(output)
        return output, layer_present
    def split_heads(self, x, transpose):
        """
        split 3d tensor to 4d and switch certain axes
        Inputs:
            x: input tensor
            transpose: tuple, the transpose sequence
        Returns:
            x_transpose: the 4d output
        """
        x_size = P.Shape()(x)
        new_x_shape = x_size[:-1] + (self.n_head, self.size_per_head)
        x = self.reshape(x, new_x_shape)
        x_transpose = self.transpose(x, transpose)
        return x_transpose
    def merge_heads(self, x):
        """
        convert a 4d input to a 3d output
        Inputs:
            x: input tensor
        Returns:
            x_merge: the 3d output
        """
        x = self.transpose(x, (0, 2, 1, 3)) #bs, seq_length, head, size_per_head
        x_shape = P.Shape()(x)
        new_shape = x_shape[:-2] + (x_shape[-2]*x_shape[-1],)
        x_merge = self.reshape(x, new_shape)
        return x_merge
    def _attn(self, query, key, value, attention_mask):
        """
        Get the weighted score along the seq_length
        Inputs:
            query: the query matrix
            key: the key matrix
            value: the value matrix
            attention_mask: the attention mask matrix with shape (batch_size, 1, seq_length, seq_length)
        Returns:
            weighted_values: Tensor, the weighted sum scores
        """
        if not self.scale:
            query = query / F.cast(self.coeff, F.dtype(query))
            key = key / F.cast(self.coeff, F.dtype(key))
        score = self.batch_matmul(query, key)
        if self.scale:
            score = score / P.Cast()(self.scale_factor, P.DType()(score))
        ori_dtype = P.DType()(score)
        score = P.Cast()(score, mstype.float32)
        multiplu_out = P.Sub()(P.Cast()(F.tuple_to_array((1.0,)), P.DType()(score)),
                               P.Cast()(attention_mask, P.DType()(score)))
        adder = P.Mul()(multiplu_out, self.multiply_data)
        attention_scores = adder + score
        attention_scores = P.Cast()(attention_scores, ori_dtype)
        shape = F.shape(attention_scores)
        attention_probs = nn.Softmax()(F.reshape(attention_scores, (-1, shape[-1])))
        attention_probs = F.reshape(attention_probs, shape)
        attention_probs = self.prob_dropout(attention_probs)
        weighted_values = self.batch_matmul(attention_probs, value)
        return weighted_values
 class Block(nn.Cell):
    """
    The basic block of GPT network
    Args:
        config(GPTConfig): the config of network
        layer_idx: current layer index
    Inputs:
        x: the output of previous layer(input_ids for the first layer)
        attention_mask: the attention mask matrix with shape (batch_size, 1, seq_length, seq_length)
        layer_past: the previous feature map
    Returns:
        output: Tensor, the output logit of this layer
        layer_present: Tensor, the feature map of current layer
    """
    def __init__(self, config, layer_idx):
        super(Block, self).__init__()
        scale = 1 / math.sqrt(2.0*layer_idx)
        self.layernorm1 = nn.LayerNorm((config.embedding_size,)).to_float(config.compute_dtype)
        self.attention = Attention(config, scale, layer_idx)
        self.layernorm2 = nn.LayerNorm((config.embedding_size,)).to_float(config.compute_dtype)
        self.output = Output(config, scale)
        self.post_layernorm_residual = config.post_layernorm_residual
    def construct(self, x, attention_mask, layer_past=None):
        """basic block of each layer"""
        input_x = self.layernorm1(x)
        attention, layer_present = self.attention(input_x, attention_mask, layer_past)
        if self.post_layernorm_residual:
            x = input_x + attention
        else:
            x = x + attention
        output_x = self.layernorm2(x)
        mlp_logit = self.output(output_x)
        if self.post_layernorm_residual:
            output = output_x + mlp_logit
        else:
            output = x + mlp_logit
        return output, layer_present
 class GPT_Model(nn.Cell):
    """
    The backbone of GPT network
    Args:
        config(GPTConfig): the config of network
    Inputs:
        input_ids: the tokenized inputs with datatype int32
        input_mask: the mask indicating whether each position is a valid input
        layer_past: the previous feature map
    Returns:
        output_state: Tensor, the output logit of backbone
        present_layer: Tensor, the current feature map
        embedding_table: Tensor, the embedding table for the vocabulary
    """
    def __init__(self, config):
        super(GPT_Model, self).__init__()
        self.get_attention_mask = AttentionMask(config)
        self.word_embedding = EmbeddingLookup(config)
        self.position_embedding = nn.Embedding(config.seq_length, config.embedding_size,
                                               embedding_table=TruncatedNormal(0.02))
        self.blocks = nn.CellList()
        for i in range(config.num_layers):
            self.blocks.append(Block(config, i+1))
        self.layernorm = nn.LayerNorm((config.embedding_size,)).to_float(config.compute_dtype)
        self.use_past = config.use_past
        self.past = tuple([None]*config.num_layers)
        self.num_layers = config.num_layers
    def construct(self, input_ids, input_mask, layer_past=None):
        """GPT model"""
        if not self.use_past:
            layer_past = self.past
        input_embedding, embedding_table = self.word_embedding(input_ids)
        batch_size, seq_length = F.shape(input_ids)
        input_position = F.tuple_to_array(F.make_range(seq_length))
        input_position = P.Tile()(input_position, (batch_size, 1))
        position_embedding = self.position_embedding(input_position)
        hidden_states = input_embedding + position_embedding
        hidden_states = P.Cast()(hidden_states, mstype.float16)
        attention_mask = self.get_attention_mask(input_mask)
        attention_mask = P.ExpandDims()(attention_mask, 1)
        present_layer = ()
        for i in range(self.num_layers):
            hidden_states, present = self.blocks[i](hidden_states, attention_mask, layer_past)
            present_layer = present_layer + (present,)
        output_state = self.layernorm(hidden_states)
        return output_state, present_layer, embedding_table
 class GPT_Head(nn.Cell):
    """
    Head for GPT to get the logits of each token in the vocab
    Args:
        config(GPTConfig): the config of network
    Inputs:
        state: the output of the backbone
        embedding_table: the embedding table of the vocabulary
    Returns:
        logits: Tensor, the logits of the corresponding inputs
    """
    def __init__(self, config):
        super(GPT_Head, self).__init__()
        self.matmul = P.MatMul(transpose_b=True)
        self.embedding_size = config.embedding_size
        self.log_softmax = P.LogSoftmax(axis=-1)
        self.dtype = config.compute_dtype
        self.cast = P.Cast()
    def construct(self, state, embedding_table):
        state = P.Reshape()(state, (-1, self.embedding_size))
        logits = self.matmul(state, self.cast(embedding_table, self.dtype))
        return logits
 class GPT(nn.Cell):
    """
    The GPT network consisting of two parts the backbone and the head
    Args:
        config(GPTConfig): the config of network
    Inputs:
        input_ids: the tokenized inputs
        input_mask: the mask indicating whether each position is a valid input
        past: the previous feature map
    Returns:
        logits: Tensor: the logits of the corresponding inputs with shape (batch_size, seq_length, vocab_size)
    """
    def __init__(self, config):
        super(GPT, self).__init__()
        self.backbone = GPT_Model(config)
        self.head = GPT_Head(config)
    def construct(self, input_ids, input_mask, past=None):
        output_states, _, embedding_table = self.backbone(input_ids, input_mask, past)
        logits = self.head(output_states, embedding_table)
        return logits
 class CrossEntropyLoss(nn.Cell):
    """
    Calculate the cross entropy loss
    Args:
        config(GPTConfig): the config of the network
    Inputs:
        logits: the output logits of the backbone
        label: the ground truth label of the sample
        input_mask: the mask indicating whether each position is a valid input
    Returns:
        loss: Tensor, the corrsponding cross entropy loss
    """
    def __init__(self, config):
        super(CrossEntropyLoss, self).__init__()
        self.log_softmax = nn.LogSoftmax(axis=-1)
        self.mean = P.ReduceMean()
        self.sum = P.ReduceSum()
        self.onehot = P.OneHot()
        self.on_value = Tensor(1.0, mstype.float32)
        self.off_value = Tensor(0.0, mstype.float32)
        self.vocab_size = config.vocab_size
    def construct(self, logits, label, input_mask):
        logits = self.log_softmax(P.Cast()(logits, mstype.float32))
        label = P.Reshape()(label, (-1,))
        one_hot_label = self.onehot(label, self.vocab_size, self.on_value, self.off_value)
        loss_sum = P.Neg()(self.sum(logits*one_hot_label, (-1,)))
        input_mask = P.Reshape()(input_mask, (-1,))
        numerator = self.sum(loss_sum*input_mask)
        denominator = self.sum(input_mask) + P.Cast()(F.tuple_to_array((1e-5,)), mstype.float32)
        loss = numerator / denominator
        return loss
 class GPTWithLoss(nn.Cell):
    """
    GPT training loss
    Args:
        network: backbone network of GPT2/3
        loss: loss function, e.g., crossentropy
        eos_token: the end_of_sentence token
    Inputs:
        input_ids: the tokenized inputs
        past: the previous feature map
    Returns:
        output: Tensor, the loss of the network
    """
    def __init__(self, network, loss, eos_token=50256):
        super(GPTWithLoss, self).__init__(auto_prefix=False)
        self.network = network
        self.loss = loss
        self.eos_token = eos_token
    def construct(self, input_ids, past=None):
        tokens = input_ids[:, :-1]
        input_mask = F.cast(F.not_equal(tokens, self.eos_token), mstype.float32)
        logits = self.network(tokens, input_mask, past)
        labels = input_ids[:, 1:]
        output = self.loss(logits, labels, input_mask)
        return output
 class EvalNet(nn.Cell):
    """
    GPT evaluation net
    Args:
        backbone: backbone network of GPT2/3
        generate: enable generate mode
    Inputs:
        input_ids: the tokenized inpus
    Returns:
        outputs: Tensor, corresponding output for different tasks
    """
    def __init__(self, backbone, generate=False):
        super(EvalNet, self).__init__(auto_prefix=False)
        self.backbone = backbone
        self.argmax = P.Argmax()
        self.generate = generate
    def construct(self, input_ids):
        """evaluation net"""
        input_mask = F.cast(F.not_equal(input_ids, 0), mstype.float32)
        logits = self.backbone(input_ids, input_mask)
        outputs = None
        if self.generate:
            outputs = nn.LogSoftmax()(logits)
            outputs = F.tensor_pow(np.e, outputs)
        else:
            outputs = self.argmax(logits)
        return outputs
--- a/model_zoo/official/nlp/gpt/src/gpt_wrapcell.py
+++ b/model_zoo/official/nlp/gpt/src/gpt_wrapcell.py
@@ -0,0 +1,157 @@
 # Copyright 2020 Huawei Technologies Co., Ltd
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 # http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ============================================================================
 """GPT training wrapper"""
 import mindspore.nn as nn
 from mindspore.ops import operations as P
 from mindspore.ops import composite as C
 from mindspore.ops import functional as F
 from mindspore import context
 from mindspore.context import ParallelMode
 from mindspore.nn.wrap.grad_reducer import DistributedGradReducer
 from mindspore.communication.management import get_group_size
 from mindspore.common.tensor import Tensor
 import mindspore.common.dtype as mstype
 from mindspore.common.parameter import Parameter
 from utils import ClipByGlobalNorm
 GRADIENT_CLIP_TYPE = 1
 GRADIENT_CLIP_VALUE = 1.0
 clip_grad = C.MultitypeFuncGraph("clip_grad")
@clip_grad.register("Number", "Number", "Tensor")
 def _clip_grad(clip_type, clip_value, grad):
    """
    Clip gradients.
    Inputs:
        clip_type (int): The way to clip, 0 for 'value', 1 for 'norm'.
        clip_value (float): Specifies how much to clip.
        grad (tuple[Tensor]): Gradients.
    Outputs:
        tuple[Tensor], clipped gradients.
    """
    if clip_type not in [0, 1]:
        return grad
    dt = F.dtype(grad)
    if clip_type == 0:
        new_grad = C.clip_by_value(grad, F.cast(F.tuple_to_array((-clip_value,)), dt),
                                   F.cast(F.tuple_to_array((clip_value,)), dt))
    else:
        new_grad = nn.ClipByNorm()(grad, F.cast(F.tuple_to_array((clip_value,)), dt))
    return new_grad
 grad_scale = C.MultitypeFuncGraph("grad_scale")
 reciprocal = P.Reciprocal()
@grad_scale.register("Tensor", "Tensor")
 def tensor_grad_scale(scale, grad):
    return grad * reciprocal(scale)
 class GPTTrainOneStepWithLossScaleCell(nn.Cell):
    """
    Encapsulation class of GPT network training.
    Append an optimizer to the training network after that the construct
    function can be called to create the backward graph.
    Args:
        network (Cell): The training network. Note that loss function should have been added.
        optimizer (Optimizer): Optimizer for updating the weights.
        scale_update_cell (Cell): Cell to do the loss scale. Default: None.
    """
    def __init__(self, network, optimizer, scale_update_cell=None, enable_global_norm=False):
        super(GPTTrainOneStepWithLossScaleCell, self).__init__(auto_prefix=False)
        self.network = network
        self.weights = optimizer.parameters
        self.optimizer = optimizer
        self.enable_global_norm = enable_global_norm
        self.grad = C.GradOperation(get_by_list=True,
                                    sens_param=True)
        self.reducer_flag = False
        self.allreduce = P.AllReduce()
        self.parallel_mode = context.get_auto_parallel_context("parallel_mode")
        if self.parallel_mode in [ParallelMode.DATA_PARALLEL, ParallelMode.HYBRID_PARALLEL]:
            self.reducer_flag = True
        self.grad_reducer = F.identity
        self.degree = 1
        if self.reducer_flag:
            self.degree = get_group_size()
            self.grad_reducer = DistributedGradReducer(optimizer.parameters, False, self.degree)
        self.is_distributed = (self.parallel_mode != ParallelMode.STAND_ALONE)
        self.cast = P.Cast()
        self.alloc_status = P.NPUAllocFloatStatus()
        self.get_status = P.NPUGetFloatStatus()
        self.clear_before_grad = P.NPUClearFloatStatus()
        self.reduce_sum = P.ReduceSum(keep_dims=False)
        self.depend_parameter_use = P.ControlDepend(depend_mode=1)
        self.base = Tensor(1, mstype.float32)
        self.less_equal = P.LessEqual()
        self.hyper_map = C.HyperMap()
        self.loss_scale = None
        self.loss_scaling_manager = scale_update_cell
        if scale_update_cell:
            self.loss_scale = Parameter(Tensor(scale_update_cell.get_loss_scale(), dtype=mstype.float32),
                                        name="loss_scale")
    @C.add_flags(has_effect=True)
    def construct(self,
                  input_ids,
                  past=None,
                  sens=None):
        """Defines the computation performed."""
        weights = self.weights
        loss = self.network(input_ids,
                            past)
        if sens is None:
            scaling_sens = self.loss_scale
        else:
            scaling_sens = sens
        # alloc status and clear should be right before gradoperation
        init = self.alloc_status()
        self.clear_before_grad(init)
        grads = self.grad(self.network, weights)(input_ids,
                                                 past,
                                                 self.cast(scaling_sens,
                                                           mstype.float32))
        # apply grad reducer on grads
        grads = self.grad_reducer(grads)
        grads = self.hyper_map(F.partial(grad_scale, scaling_sens * self.degree), grads)
        if self.enable_global_norm:
            grads = ClipByGlobalNorm()(grads)
        else:
            grads = self.hyper_map(F.partial(clip_grad, GRADIENT_CLIP_TYPE, GRADIENT_CLIP_VALUE), grads)
        self.get_status(init)
        flag_sum = self.reduce_sum(init, (0,))
        if self.is_distributed:
            # sum overflow flag over devices
            flag_reduce = self.allreduce(flag_sum)
            cond = self.less_equal(self.base, flag_reduce)
        else:
            cond = self.less_equal(self.base, flag_sum)
        overflow = cond
        if sens is None:
            overflow = self.loss_scaling_manager(self.loss_scale, cond)
        if overflow:
            succ = False
        else:
            succ = self.optimizer(grads)
        ret = (loss, cond, scaling_sens)
        return F.depend(ret, succ)
--- a/model_zoo/official/nlp/gpt/src/inference.py
+++ b/model_zoo/official/nlp/gpt/src/inference.py
@@ -0,0 +1,60 @@
 # Copyright 2020 Huawei Technologies Co., Ltd
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 # http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ============================================================================
 """
 TopK for text generation
 """
 import numpy as np
 import mindspore.common.dtype as mstype
 from mindspore.common.tensor import Tensor
 def generate(model, origin_inputs, seq_length, end_token=50256):
    """
    TopK for text generation
    Inputs:
        model: the model for inferencing
        origin_inputs: the original inputs based on which the model will continue writing
        seq_length: seq_length for the model
        end_token: end of sentence token id
    Returns:
        outputs: the ids for the generated text
    """
    TOPK = 5
    seq_length = seq_length
    bs, valid_length = origin_inputs.shape
    pad_length = seq_length - origin_inputs.shape[-1]
    input_ids = np.pad(origin_inputs, ((0, 0), (0, pad_length)), 'constant', constant_values=(0, 0))
    print("input_ids is ", input_ids)
    while valid_length < seq_length:
        inputs = Tensor(input_ids, mstype.int32)
        logits = model(inputs).asnumpy()
        logits = logits.reshape(bs, seq_length, -1)
        probs = logits[0, valid_length-1, :]
        p_args = probs.argsort()[::-1][:TOPK]
        p = probs[p_args]
        p = p / sum(p)
        target_index = np.random.choice(len(p), p=p)
        if p_args[target_index] == end_token or valid_length == seq_length-1:
            outputs = input_ids
            break
        input_ids[0][valid_length] = p_args[target_index]
        valid_length += 1
    length = np.sum(outputs != 0)
    outputs = outputs[0][:length]
    return outputs
--- a/model_zoo/official/nlp/gpt/src/utils.py
+++ b/model_zoo/official/nlp/gpt/src/utils.py
@@ -0,0 +1,138 @@
 # Copyright 2020 Huawei Technologies Co., Ltd
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 # http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ============================================================================
 """
 network config setting, gradient clip function and dynamic learning rate function
 """
 import mindspore.nn as nn
 from mindspore.ops import operations as P
 from mindspore.ops import composite as C
 from mindspore.ops import functional as F
 import mindspore.common.dtype as mstype
 from mindspore.common.tensor import Tensor
 from mindspore.nn.learning_rate_schedule import LearningRateSchedule, PolynomialDecayLR, WarmUpLR, CosineDecayLR
 import numpy as np
 class GPTConfig:
    """
    GPT config class which defines the model size
    """
    def __init__(self,
                 batch_size=32,
                 seq_length=1024,
                 vocab_size=50257,
                 embedding_size=768,
                 num_layers=12,
                 num_heads=12,
                 expand_ratio=4,
                 post_layernorm_residual=False,
                 dropout_rate=0.1,
                 compute_dtype=mstype.float16,
                 use_past=False):
        self.batch_size = batch_size
        self.seq_length = seq_length
        self.vocab_size = vocab_size
        self.embedding_size = embedding_size
        self.num_layers = num_layers
        self.num_heads = num_heads
        self.expand_ratio = expand_ratio
        self.post_layernorm_residual = post_layernorm_residual
        self.dropout_rate = dropout_rate
        self.compute_dtype = compute_dtype
        self.use_past = use_past
 get_square_sum = C.MultitypeFuncGraph("get_square_sum")
@get_square_sum.register("Tensor")
 def _get_square_sum(grad):
    norm = P.ReduceSum(False)(F.square(grad), ())
    norm = F.expand_dims(F.cast(norm, mstype.float32), 0)
    return norm
 apply_global_norm = C.MultitypeFuncGraph("apply_global_norm")
@apply_global_norm.register("Tensor", "Tensor", "Tensor")
 def _apply_global_norm(clip_norm, global_norm, grad):
    grad = grad * clip_norm / global_norm
    return grad
 class GlobalNorm(nn.Cell):
    """
    Calculate the global norm value of given tensors
    """
    def __init__(self):
        super(GlobalNorm, self).__init__()
        self.norm = nn.Norm()
        self.hyper_map = C.HyperMap()
    def construct(self, grads):
        square_sum = self.hyper_map(get_square_sum, grads)
        global_norms = F.sqrt(F.addn(square_sum) / F.scalar_to_array(len(square_sum)))
        return global_norms
 class ClipByGlobalNorm(nn.Cell):
    """
    Clip grads by global norm
    """
    def __init__(self, clip_norm=1.0):
        super(ClipByGlobalNorm, self).__init__()
        self.global_norm = GlobalNorm()
        self.clip_norm = Tensor([clip_norm], mstype.float32)
        self.hyper_map = C.HyperMap()
    def construct(self, grads):
        global_norm = self.global_norm(grads)
        cond = P.GreaterEqual()(global_norm, self.clip_norm)
        global_norm = F.select(cond, global_norm, self.clip_norm)
        grads = self.hyper_map(F.partial(apply_global_norm, self.clip_norm, global_norm), grads)
        return grads
 class LearningRate(LearningRateSchedule):
    """
    Warmup-decay learning rate for GPT network.
    """
    def __init__(self, learning_rate, end_learning_rate, warmup_steps, decay_steps, power=1.0, use_cosine=True):
        super(LearningRate, self).__init__()
        self.warmup_flag = False
        if warmup_steps > 0:
            self.warmup_flag = True
            self.warmup_lr = WarmUpLR(learning_rate, warmup_steps)
        self.decay_lr = PolynomialDecayLR(learning_rate, end_learning_rate, decay_steps, power)
        self.cosine_decay_lr = CosineDecayLR(end_learning_rate, learning_rate, decay_steps)
        self.warmup_steps = Tensor(np.array([warmup_steps]).astype(np.float32))
        self.greater = P.Greater()
        self.one = Tensor(np.array([1.0]).astype(np.float32))
        self.cast = P.Cast()
        self.use_cosine = use_cosine
    def construct(self, global_step):
        """dynamic learning rate"""
        if not self.use_cosine:
            decay_lr = self.decay_lr(global_step)
        else:
            decay_lr = self.cosine_decay_lr(global_step)
        if self.warmup_flag:
            is_warmup = self.cast(self.greater(self.warmup_steps, global_step), mstype.float32)
            warmup_lr = self.warmup_lr(global_step)
            lr = (self.one - is_warmup) * decay_lr + is_warmup * warmup_lr
        else:
            lr = decay_lr
        return lr
--- a/model_zoo/official/nlp/gpt/train.py
+++ b/model_zoo/official/nlp/gpt/train.py
@@ -0,0 +1,133 @@
 # Copyright 2020 Huawei Technologies Co., Ltd
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 # http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ============================================================================
 """
 GPT train script
 """
 import os
 import argparse
 from mindspore import context
 from mindspore.train.model import Model
 import mindspore.communication.management as D
 from mindspore.context import ParallelMode
 import mindspore.nn as nn
 from mindspore.train.callback import TimeMonitor, LossMonitor, ModelCheckpoint, CheckpointConfig
 from mindspore.nn.wrap.loss_scale import DynamicLossScaleUpdateCell
 import mindspore.common.dtype as mstype
 from mindspore.common import set_seed
 from src.dataset import create_dataset
 from src.gpt import GPT, GPTWithLoss, CrossEntropyLoss
 from src.gpt_wrapcell import GPTTrainOneStepWithLossScaleCell
 from src.utils import GPTConfig, LearningRate
 def run_train():
    """train function for GPT"""
    parser = argparse.ArgumentParser(description="GPT training")
    parser.add_argument('--device_id', type=int, default=0, help="Device id, default is 0.")
    parser.add_argument("--device_num", type=int, default=1, help="Use device nums, default is 1.")
    parser.add_argument("--distribute", type=str, default="false", choices=["true", "false"],
                        help="Run distribute, default is false.")
    parser.add_argument("--optimizer", type=str, default="adam", choices=["adam", "lamb"],
                        help="select which optimizer to be used, default adam")
    parser.add_argument("--epoch_size", type=int, default=10, help="Epoch size, default is 10.")
    parser.add_argument("--warmup_step", type=int, default=10000, help="Warmup step, default is 10000.")
    parser.add_argument("--data_path", type=str, default="", help="Data path of your MindRecord files.")
    parser.add_argument("--start_lr", type=float, default="5e-5", help="Start learning rate, default is 5e-5.")
    parser.add_argument("--end_lr", type=float, default="1e-10", help="End learning rate, default is 1e-10.")
    parser.add_argument("--sink_size", type=int, default=100, help="Sink size for every iteration, default is 100")
    args_opt = parser.parse_args()
    device_id = int(os.getenv("DEVICE_ID"))
    context.set_context(mode=context.GRAPH_MODE, device_target="Ascend", device_id=device_id)
    if args_opt.distribute == "true":
        D.init()
        device_num = args_opt.device_num
        rank = device_id % device_num
        print("device_id is {}, rank_id is {}".format(device_id, rank))
        context.reset_auto_parallel_context()
        context.set_auto_parallel_context(parallel_mode=ParallelMode.DATA_PARALLEL, gradients_mean=True,
                                          device_num=device_num)
    else:
        rank = 0
        device_num = 1
    config = GPTConfig(batch_size=4,
                       seq_length=1024,
                       vocab_size=50257,
                       embedding_size=1024,
                       num_layers=24,
                       num_heads=16,
                       expand_ratio=4,
                       post_layernorm_residual=False,
                       dropout_rate=0.1,
                       compute_dtype=mstype.float16,
                       use_past=False)
    gpt = GPT(config)
    loss = CrossEntropyLoss(config)
    gpt_with_loss = GPTWithLoss(gpt, loss)
    ds = create_dataset(config.batch_size, data_path=args_opt.data_path, device_num=device_num, rank=rank)
    epoch_num = args_opt.epoch_size
    step_per_epoch = ds.get_dataset_size()
    lr = LearningRate(learning_rate=args_opt.start_lr,
                      end_learning_rate=args_opt.end_lr,
                      warmup_steps=args_opt.warmup_step,
                      decay_steps=epoch_num*step_per_epoch)
    decay_filter = lambda x: 'layernorm' not in x.name.lower() and "bias" not in x.name.lower()
    params = gpt.trainable_params()
    decay_params = list(filter(decay_filter, params))
    other_params = list(filter(lambda x: not decay_filter(x), params))
    group_params = [{'params': decay_params, 'weight_decay': 1e-2},
                    {'params': other_params, 'weight_decay': 0.0},
                    {'order_params': params}]
    if args_opt.optimizer == "lamb":
        optimizer = nn.Lamb(group_params, learning_rate=lr)
    else:
        optimizer = nn.AdamWeightDecay(group_params, learning_rate=lr)
    callback_size = args_opt.sink_size
    actual_epoch_num = int(epoch_num * step_per_epoch/callback_size)
    callback = [TimeMonitor(callback_size), LossMonitor(callback_size)]
    config_ck = CheckpointConfig(save_checkpoint_steps=step_per_epoch, keep_checkpoint_max=1)
    ckpoint_cb = ModelCheckpoint(prefix="GPT2", config=config_ck)
    callback.append(ckpoint_cb)
    update_cell = DynamicLossScaleUpdateCell(loss_scale_value=1024,
                                             scale_factor=2,
                                             scale_window=1000)
    gpt_with_grads = GPTTrainOneStepWithLossScaleCell(gpt_with_loss, optimizer=optimizer,
                                                      scale_update_cell=update_cell)
    model = Model(gpt_with_grads)
    model.train(actual_epoch_num, ds, callbacks=callback, sink_size=callback_size)
 if __name__ == "__main__":
    set_seed(12315)
    run_train()