Merge pull request !6274 from linqingke/fasterrcnntags/v1.0.0
| @@ -0,0 +1,272 @@ | |||
| # Contents | |||
| - [DenseNet121 Description](#densenet121-description) | |||
| - [Model Architecture](#model-architecture) | |||
| - [Dataset](#dataset) | |||
| - [Features](#features) | |||
| - [Mixed Precision](#mixed-precision) | |||
| - [Environment Requirements](#environment-requirements) | |||
| - [Quick Start](#quick-start) | |||
| - [Script Description](#script-description) | |||
| - [Script and Sample Code](#script-and-sample-code) | |||
| - [Script Parameters](#script-parameters) | |||
| - [Training Process](#training-process) | |||
| - [Training](#training) | |||
| - [Distributed Training](#distributed-training) | |||
| - [Evaluation Process](#evaluation-process) | |||
| - [Evaluation](#evaluation) | |||
| - [Model Description](#model-description) | |||
| - [Performance](#performance) | |||
| - [Training accuracy results](#training-accuracy-results) | |||
| - [Training performance results](#yraining-performance-results) | |||
| - [Description of Random Situation](#description-of-random-situation) | |||
| - [ModelZoo Homepage](#modelzoo-homepage) | |||
| # [DenseNet121 Description](#contents) | |||
| DenseNet121 is a convolution based neural network for the task of image classification. The paper describing the model can be found [here](https://arxiv.org/abs/1608.06993). HuaWei’s DenseNet121 is a implementation on [MindSpore](https://www.mindspore.cn/). | |||
| The repository also contains scripts to launch training and inference routines. | |||
| # [Model Architecture](#contents) | |||
| DenseNet121 builds on 4 densely connected block. In every dense block, each layer obtains additional inputs from all preceding layers and passes on its own feature-maps to all subsequent layers. Concatenation is used. Each layer is receiving a “collective knowledge” from all preceding layers. | |||
| # [Dataset](#contents) | |||
| Dataset used: ImageNet | |||
| The default configuration of the Dataset are as follows: | |||
| - Training Dataset preprocess: | |||
| - Input size of images is 224\*224 | |||
| - Range (min, max) of respective size of the original size to be cropped is (0.08, 1.0) | |||
| - Range (min, max) of aspect ratio to be cropped is (0.75, 1.333) | |||
| - Probability of the image being flipped set to 0.5 | |||
| - Randomly adjust the brightness, contrast, saturation (0.4, 0.4, 0.4) | |||
| - Normalize the input image with respect to mean and standard deviation | |||
| - Test Dataset preprocess: | |||
| - Input size of images is 224\*224 (Resize to 256\*256 then crops images at the center) | |||
| - Normalize the input image with respect to mean and standard deviation | |||
| # [Features](#contents) | |||
| ## Mixed Precision | |||
| The [mixed precision](https://www.mindspore.cn/tutorial/zh-CN/master/advanced_use/mixed_precision.html) training method accelerates the deep learning neural network training process by using both the single-precision and half-precision data formats, and maintains the network precision achieved by the single-precision training at the same time. Mixed precision training can accelerate the computation process, reduce memory usage, and enable a larger model or batch size to be trained on specific hardware. | |||
| For FP16 operators, if the input data type is FP32, the backend of MindSpore will automatically handle it with reduced precision. Users could check the reduced-precision operators by enabling INFO log and then searching ‘reduce precision’. | |||
| # [Environment Requirements](#contents) | |||
| - Hardware(Ascend) | |||
| - Prepare hardware environment with Ascend AI processor. If you want to try Ascend , please send the [application form](https://obs-9be7.obs.cn-east-2.myhuaweicloud.com/file/other/Ascend%20Model%20Zoo%E4%BD%93%E9%AA%8C%E8%B5%84%E6%BA%90%E7%94%B3%E8%AF%B7%E8%A1%A8.docx) to ascend@huawei.com. Once approved, you can get the resources. | |||
| - Framework | |||
| - [MindSpore](https://www.mindspore.cn/install/en) | |||
| - For more information, please check the resources below: | |||
| - [MindSpore tutorials](https://www.mindspore.cn/tutorial/zh-CN/master/index.html) | |||
| - [MindSpore API](https://www.mindspore.cn/api/zh-CN/master/index.html) | |||
| # [Quick Start](#contents) | |||
| After installing MindSpore via the official website, you can start training and evaluation as follows: | |||
| ```python | |||
| # run training example | |||
| python train.py --data_dir /PATH/TO/DATASET --is_distributed 0> train.log 2>&1 & | |||
| # run distributed training example | |||
| sh scripts/run_distribute_train.sh 8 rank_table.json /PATH/TO/DATASET | |||
| # run evaluation example | |||
| python eval.py --data_dir /PATH/TO/DATASET --pretrained /PATH/TO/CHECKPOINT> eval.log 2>&1 & | |||
| OR | |||
| sh scripts/run_distribute_eval.sh 8 rank_table.json /PATH/TO/DATASET /PATH/TO/CHECKPOINT | |||
| ``` | |||
| For distributed training, a hccl configuration file with JSON format needs to be created in advance. | |||
| Please follow the instructions in the link below: | |||
| https://gitee.com/mindspore/mindspore/tree/master/model_zoo/utils/hccl_tools. | |||
| # [Script Description](#contents) | |||
| ## [Script and Sample Code](#contents) | |||
| ``` | |||
| ├── model_zoo | |||
| ├── README.md // descriptions about all the models | |||
| ├── densenet121 | |||
| ├── README.md // descriptions about densenet121 | |||
| ├── scripts | |||
| │ ├── run_distribute_train.sh // shell script for distributed on Ascend | |||
| │ ├── run_distribute_eval.sh // shell script for evaluation on Ascend | |||
| ├── src | |||
| │ ├── datasets // dataset processing function | |||
| │ ├── losses | |||
| │ ├──crossentropy.py // densenet loss function | |||
| │ ├── lr_scheduler | |||
| │ ├──lr_scheduler.py // densenet learning rate schedule function | |||
| │ ├── network | |||
| │ ├──densenet.py // densenet architecture | |||
| │ ├──optimizers // densenet optimize function | |||
| │ ├──utils | |||
| │ ├──logging.py // logging function | |||
| │ ├──var_init.py // densenet variable init function | |||
| │ ├── config.py // network config | |||
| ├── train.py // training script | |||
| ├── eval.py // evaluation script | |||
| ``` | |||
| ## [Script Parameters](#contents) | |||
| You can modify the training behaviour through the various flags in the `train.py` script. Flags in the `train.py` script are as follows: | |||
| ``` | |||
| --data_dir train data dir | |||
| --num_classes num of classes in dataset(default:1000) | |||
| --image_size image size of the dataset | |||
| --per_batch_size mini-batch size (default: 256) per gpu | |||
| --pretrained path of pretrained model | |||
| --lr_scheduler type of LR schedule: exponential, cosine_annealing | |||
| --lr initial learning rate | |||
| --lr_epochs epoch milestone of lr changing | |||
| --lr_gamma decrease lr by a factor of exponential lr_scheduler | |||
| --eta_min eta_min in cosine_annealing scheduler | |||
| --T_max T_max in cosine_annealing scheduler | |||
| --max_epoch max epoch num to train the model | |||
| --warmup_epochs warmup epoch(when batchsize is large) | |||
| --weight_decay weight decay (default: 1e-4) | |||
| --momentum momentum(default: 0.9) | |||
| --label_smooth whether to use label smooth in CE | |||
| --label_smooth_factor smooth strength of original one-hot | |||
| --log_interval logging interval(dafault:100) | |||
| --ckpt_path path to save checkpoint | |||
| --ckpt_interval the interval to save checkpoint | |||
| --is_save_on_master save checkpoint on master or all rank | |||
| --is_distributed if multi device(default: 1) | |||
| --rank local rank of distributed(default: 0) | |||
| --group_size world size of distributed(default: 1) | |||
| ``` | |||
| ## [Training Process](#contents) | |||
| ### Training | |||
| - running on Ascend | |||
| ``` | |||
| python train.py --data_dir /PATH/TO/DATASET --is_distributed 0 > train.log 2>&1 & | |||
| ``` | |||
| The python command above will run in the background, The log and model checkpoint will be generated in `output/202x-xx-xx_time_xx_xx_xx/`. The loss value will be achieved as follows: | |||
| ``` | |||
| 2020-08-22 16:58:56,617:INFO:epoch[0], iter[5003], loss:4.367, mean_fps:0.00 imgs/sec | |||
| 2020-08-22 16:58:56,619:INFO:local passed | |||
| 2020-08-22 17:02:19,920:INFO:epoch[1], iter[10007], loss:3.193, mean_fps:6301.11 imgs/sec | |||
| 2020-08-22 17:02:19,921:INFO:local passed | |||
| 2020-08-22 17:05:43,112:INFO:epoch[2], iter[15011], loss:3.096, mean_fps:6304.53 imgs/sec | |||
| 2020-08-22 17:05:43,113:INFO:local passed | |||
| ... | |||
| ``` | |||
| ### Distributed Training | |||
| - running on Ascend | |||
| ``` | |||
| sh scripts/run_distribute_train.sh 8 rank_table.json /PATH/TO/DATASET | |||
| ``` | |||
| The above shell script will run distribute training in the background. You can view the results log and model checkpoint through the file `train[X]/output/202x-xx-xx_time_xx_xx_xx/`. The loss value will be achieved as follows: | |||
| ``` | |||
| 2020-08-22 16:58:54,556:INFO:epoch[0], iter[5003], loss:3.857, mean_fps:0.00 imgs/sec | |||
| 2020-08-22 17:02:19,188:INFO:epoch[1], iter[10007], loss:3.18, mean_fps:6260.18 imgs/sec | |||
| 2020-08-22 17:05:42,490:INFO:epoch[2], iter[15011], loss:2.621, mean_fps:6301.11 imgs/sec | |||
| 2020-08-22 17:09:05,686:INFO:epoch[3], iter[20015], loss:3.113, mean_fps:6304.37 imgs/sec | |||
| 2020-08-22 17:12:28,925:INFO:epoch[4], iter[25019], loss:3.29, mean_fps:6303.07 imgs/sec | |||
| 2020-08-22 17:15:52,167:INFO:epoch[5], iter[30023], loss:2.865, mean_fps:6302.98 imgs/sec | |||
| ... | |||
| ... | |||
| ``` | |||
| ## [Evaluation Process](#contents) | |||
| ### Evaluation | |||
| - evaluation on Ascend | |||
| running the command below for evaluation. | |||
| ``` | |||
| python eval.py --data_dir /PATH/TO/DATASET --pretrained /PATH/TO/CHECKPOINT> eval.log 2>&1 & | |||
| OR | |||
| sh scripts/run_distribute_eval.sh 8 rank_table.json /PATH/TO/DATASET /PATH/TO/CHECKPOINT | |||
| ``` | |||
| The above python command will run in the background. You can view the results through the file "output/202x-xx-xx_time_xx_xx_xx/202x_xxxx.log". The accuracy of the test dataset will be as follows: | |||
| ``` | |||
| 2020-08-24 09:21:50,551:INFO:after allreduce eval: top1_correct=37657, tot=49920, acc=75.43% | |||
| 2020-08-24 09:21:50,551:INFO:after allreduce eval: top5_correct=46224, tot=49920, acc=92.60% | |||
| ``` | |||
| # [Model Description](#contents) | |||
| ## [Performance](#contents) | |||
| ### Training accuracy results | |||
| | Parameters | Densenet | | |||
| | ------------------- | --------------------------- | | |||
| | Model Version | Inception V1 | | |||
| | Resource | Ascend 910 | | |||
| | Uploaded Date | 09/15/2020 (month/day/year) | | |||
| | MindSpore Version | 1.0.0 | | |||
| | Dataset | ImageNet | | |||
| | epochs | 120 | | |||
| | outputs | probability | | |||
| | train performance | Top1:75.13%; Top5:92.57% | | |||
| ### Training performance results | |||
| | Parameters | Densenet | | |||
| | ------------------- | --------------------------- | | |||
| | Model Version | Inception V1 | | |||
| | Resource | Ascend 910 | | |||
| | Uploaded Date | 09/15/2020 (month/day/year) | | |||
| | MindSpore Version | 1.0.0 | | |||
| | Dataset | ImageNet | | |||
| | batch_size | 32 | | |||
| | outputs | probability | | |||
| | speed | 1pc:760 img/s;8pc:6000 img/s| | |||
| # [Description of Random Situation](#contents) | |||
| In dataset.py, we set the seed inside “create_dataset" function. We also use random seed in train.py. | |||
| # [ModelZoo Homepage](#contents) | |||
| Please check the official [homepage](https://gitee.com/mindspore/mindspore/tree/master/model_zoo). | |||
| @@ -0,0 +1,244 @@ | |||
| # Copyright 2020 Huawei Technologies Co., Ltd | |||
| # | |||
| # Licensed under the Apache License, Version 2.0 (the "License"); | |||
| # you may not use this file except in compliance with the License. | |||
| # You may obtain a copy of the License at | |||
| # | |||
| # http://www.apache.org/licenses/LICENSE-2.0 | |||
| # | |||
| # Unless required by applicable law or agreed to in writing, software | |||
| # distributed under the License is distributed on an "AS IS" BASIS, | |||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | |||
| # See the License for the specific language governing permissions and | |||
| # limitations under the License. | |||
| # ============================================================================ | |||
| """ | |||
| ##############test densenet example################# | |||
| python eval.py --data_dir /PATH/TO/DATASET --pretrained /PATH/TO/CHECKPOINT | |||
| """ | |||
| import os | |||
| import argparse | |||
| import datetime | |||
| import glob | |||
| import numpy as np | |||
| from mindspore import context | |||
| import mindspore.nn as nn | |||
| from mindspore import Tensor | |||
| from mindspore.communication.management import init, get_rank, get_group_size, release | |||
| from mindspore.train.serialization import load_checkpoint, load_param_into_net | |||
| from mindspore.ops import operations as P | |||
| from mindspore.ops import functional as F | |||
| from mindspore.common import dtype as mstype | |||
| from src.utils.logging import get_logger | |||
| from src.datasets import classification_dataset | |||
| from src.network import DenseNet121 | |||
| from src.config import config | |||
| devid = int(os.getenv('DEVICE_ID')) | |||
| context.set_context(mode=context.GRAPH_MODE, device_target="Davinci", | |||
| save_graphs=True, device_id=devid) | |||
| class ParameterReduce(nn.Cell): | |||
| """ | |||
| reduce parameter | |||
| """ | |||
| def __init__(self): | |||
| super(ParameterReduce, self).__init__() | |||
| self.cast = P.Cast() | |||
| self.reduce = P.AllReduce() | |||
| def construct(self, x): | |||
| one = self.cast(F.scalar_to_array(1.0), mstype.float32) | |||
| out = x * one | |||
| ret = self.reduce(out) | |||
| return ret | |||
| def parse_args(cloud_args=None): | |||
| """ | |||
| parse args | |||
| """ | |||
| parser = argparse.ArgumentParser('mindspore classification test') | |||
| # dataset related | |||
| parser.add_argument('--data_dir', type=str, default='', help='eval data dir') | |||
| parser.add_argument('--num_classes', type=int, default=1000, help='num of classes in dataset') | |||
| parser.add_argument('--image_size', type=str, default='224,224', help='image size of the dataset') | |||
| # network related | |||
| parser.add_argument('--backbone', default='resnet50', help='backbone') | |||
| parser.add_argument('--pretrained', default='', type=str, help='fully path of pretrained model to load.' | |||
| 'If it is a direction, it will test all ckpt') | |||
| # logging related | |||
| parser.add_argument('--log_path', type=str, default='outputs/', help='path to save log') | |||
| parser.add_argument('--is_distributed', type=int, default=1, help='if multi device') | |||
| parser.add_argument('--rank', type=int, default=0, help='local rank of distributed') | |||
| parser.add_argument('--group_size', type=int, default=1, help='world size of distributed') | |||
| # roma obs | |||
| parser.add_argument('--train_url', type=str, default="", help='train url') | |||
| args, _ = parser.parse_known_args() | |||
| args = merge_args(args, cloud_args) | |||
| args.per_batch_size = config.per_batch_size | |||
| args.image_size = list(map(int, args.image_size.split(','))) | |||
| return args | |||
| def get_top5_acc(top5_arg, gt_class): | |||
| sub_count = 0 | |||
| for top5, gt in zip(top5_arg, gt_class): | |||
| if gt in top5: | |||
| sub_count += 1 | |||
| return sub_count | |||
| def merge_args(args, cloud_args): | |||
| """ | |||
| merge args and cloud_args | |||
| """ | |||
| args_dict = vars(args) | |||
| if isinstance(cloud_args, dict): | |||
| for key in cloud_args.keys(): | |||
| val = cloud_args[key] | |||
| if key in args_dict and val: | |||
| arg_type = type(args_dict[key]) | |||
| if arg_type is not type(None): | |||
| val = arg_type(val) | |||
| args_dict[key] = val | |||
| return args | |||
| def test(cloud_args=None): | |||
| """ | |||
| network eval function. Get top1 and top5 ACC from classification. | |||
| The result will be save at [./outputs] by default. | |||
| """ | |||
| args = parse_args(cloud_args) | |||
| # init distributed | |||
| if args.is_distributed: | |||
| init() | |||
| args.rank = get_rank() | |||
| args.group_size = get_group_size() | |||
| args.outputs_dir = os.path.join(args.log_path, | |||
| datetime.datetime.now().strftime('%Y-%m-%d_time_%H_%M_%S')) | |||
| args.logger = get_logger(args.outputs_dir, args.rank) | |||
| args.logger.save_args(args) | |||
| # network | |||
| args.logger.important_info('start create network') | |||
| if os.path.isdir(args.pretrained): | |||
| models = list(glob.glob(os.path.join(args.pretrained, '*.ckpt'))) | |||
| f = lambda x: -1 * int(os.path.splitext(os.path.split(x)[-1])[0].split('-')[-1].split('_')[0]) | |||
| args.models = sorted(models, key=f) | |||
| else: | |||
| args.models = [args.pretrained,] | |||
| for model in args.models: | |||
| de_dataset = classification_dataset(args.data_dir, image_size=args.image_size, | |||
| per_batch_size=args.per_batch_size, | |||
| max_epoch=1, rank=args.rank, group_size=args.group_size, | |||
| mode='eval') | |||
| eval_dataloader = de_dataset.create_tuple_iterator() | |||
| network = DenseNet121(args.num_classes) | |||
| param_dict = load_checkpoint(model) | |||
| param_dict_new = {} | |||
| for key, values in param_dict.items(): | |||
| if key.startswith('moments.'): | |||
| continue | |||
| elif key.startswith('network.'): | |||
| param_dict_new[key[8:]] = values | |||
| else: | |||
| param_dict_new[key] = values | |||
| load_param_into_net(network, param_dict_new) | |||
| args.logger.info('load model {} success'.format(model)) | |||
| network.add_flags_recursive(fp16=True) | |||
| img_tot = 0 | |||
| top1_correct = 0 | |||
| top5_correct = 0 | |||
| network.set_train(False) | |||
| for data, gt_classes in eval_dataloader: | |||
| output = network(Tensor(data, mstype.float32)) | |||
| output = output.asnumpy() | |||
| gt_classes = gt_classes.asnumpy() | |||
| top1_output = np.argmax(output, (-1)) | |||
| top5_output = np.argsort(output)[:, -5:] | |||
| t1_correct = np.equal(top1_output, gt_classes).sum() | |||
| top1_correct += t1_correct | |||
| top5_correct += get_top5_acc(top5_output, gt_classes) | |||
| img_tot += args.per_batch_size | |||
| results = [[top1_correct], [top5_correct], [img_tot]] | |||
| args.logger.info('before results={}'.format(results)) | |||
| if args.is_distributed: | |||
| model_md5 = model.replace('/', '') | |||
| tmp_dir = '../cache' | |||
| if not os.path.exists(tmp_dir): | |||
| os.mkdir(tmp_dir) | |||
| top1_correct_npy = '{}/top1_rank_{}_{}.npy'.format(tmp_dir, args.rank, model_md5) | |||
| top5_correct_npy = '{}/top5_rank_{}_{}.npy'.format(tmp_dir, args.rank, model_md5) | |||
| img_tot_npy = '{}/img_tot_rank_{}_{}.npy'.format(tmp_dir, args.rank, model_md5) | |||
| np.save(top1_correct_npy, top1_correct) | |||
| np.save(top5_correct_npy, top5_correct) | |||
| np.save(img_tot_npy, img_tot) | |||
| while True: | |||
| rank_ok = True | |||
| for other_rank in range(args.group_size): | |||
| top1_correct_npy = '{}/top1_rank_{}_{}.npy'.format(tmp_dir, other_rank, model_md5) | |||
| top5_correct_npy = '{}/top5_rank_{}_{}.npy'.format(tmp_dir, other_rank, model_md5) | |||
| img_tot_npy = '{}/img_tot_rank_{}_{}.npy'.format(tmp_dir, other_rank, model_md5) | |||
| if not os.path.exists(top1_correct_npy) or not os.path.exists(top5_correct_npy) \ | |||
| or not os.path.exists(img_tot_npy): | |||
| rank_ok = False | |||
| if rank_ok: | |||
| break | |||
| top1_correct_all = 0 | |||
| top5_correct_all = 0 | |||
| img_tot_all = 0 | |||
| for other_rank in range(args.group_size): | |||
| top1_correct_npy = '{}/top1_rank_{}_{}.npy'.format(tmp_dir, other_rank, model_md5) | |||
| top5_correct_npy = '{}/top5_rank_{}_{}.npy'.format(tmp_dir, other_rank, model_md5) | |||
| img_tot_npy = '{}/img_tot_rank_{}_{}.npy'.format(tmp_dir, other_rank, model_md5) | |||
| top1_correct_all += np.load(top1_correct_npy) | |||
| top5_correct_all += np.load(top5_correct_npy) | |||
| img_tot_all += np.load(img_tot_npy) | |||
| results = [[top1_correct_all], [top5_correct_all], [img_tot_all]] | |||
| results = np.array(results) | |||
| else: | |||
| results = np.array(results) | |||
| args.logger.info('after results={}'.format(results)) | |||
| top1_correct = results[0, 0] | |||
| top5_correct = results[1, 0] | |||
| img_tot = results[2, 0] | |||
| acc1 = 100.0 * top1_correct / img_tot | |||
| acc5 = 100.0 * top5_correct / img_tot | |||
| args.logger.info('after allreduce eval: top1_correct={}, tot={}, acc={:.2f}%'.format(top1_correct, | |||
| img_tot, | |||
| acc1)) | |||
| args.logger.info('after allreduce eval: top5_correct={}, tot={}, acc={:.2f}%'.format(top5_correct, | |||
| img_tot, | |||
| acc5)) | |||
| if args.is_distributed: | |||
| release() | |||
| if __name__ == "__main__": | |||
| test() | |||
| @@ -0,0 +1,48 @@ | |||
| #!/bin/bash | |||
| # Copyright 2020 Huawei Technologies Co., Ltd | |||
| # | |||
| # Licensed under the Apache License, Version 2.0 (the "License"); | |||
| # you may not use this file except in compliance with the License. | |||
| # You may obtain a copy of the License at | |||
| # | |||
| # http://www.apache.org/licenses/LICENSE-2.0 | |||
| # | |||
| # Unless required by applicable law or agreed to in writing, software | |||
| # distributed under the License is distributed on an "AS IS" BASIS, | |||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | |||
| # See the License for the specific language governing permissions and | |||
| # limitations under the License. | |||
| # ============================================================================ | |||
| echo "==============================================================================================================" | |||
| echo "Please run the scipt as: " | |||
| echo "sh run_distribute_eval.sh DEVICE_NUM RANK_TABLE_FILE DATASET CKPT_PATH" | |||
| echo "for example: sh run_distribute_train.sh 8 /data/hccl.json /path/to/dataset /path/to/ckpt" | |||
| echo "It is better to use absolute path." | |||
| echo "=================================================================================================================" | |||
| echo "After running the scipt, the network runs in the background. The log will be generated in eval_x/log.txt" | |||
| export RANK_SIZE=$1 | |||
| export RANK_TABLE_FILE=$2 | |||
| DATASET=$3 | |||
| CKPT_PATH=$4 | |||
| for((i=0;i<RANK_SIZE;i++)) | |||
| do | |||
| export DEVICE_ID=$i | |||
| rm -rf eval_$i | |||
| mkdir ./eval_$i | |||
| cp ./*.py ./eval_$i | |||
| cp -r ./src ./eval_$i | |||
| cd ./eval_$i || exit | |||
| export RANK_ID=$i | |||
| echo "start training for rank $i, device $DEVICE_ID" | |||
| env > env.log | |||
| python eval.py \ | |||
| --data_dir=$DATASET \ | |||
| --pretrained=$CKPT_PATH > log.txt 2>&1 & | |||
| cd ../ | |||
| done | |||
| @@ -0,0 +1,45 @@ | |||
| #!/bin/bash | |||
| # Copyright 2020 Huawei Technologies Co., Ltd | |||
| # | |||
| # Licensed under the Apache License, Version 2.0 (the "License"); | |||
| # you may not use this file except in compliance with the License. | |||
| # You may obtain a copy of the License at | |||
| # | |||
| # http://www.apache.org/licenses/LICENSE-2.0 | |||
| # | |||
| # Unless required by applicable law or agreed to in writing, software | |||
| # distributed under the License is distributed on an "AS IS" BASIS, | |||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | |||
| # See the License for the specific language governing permissions and | |||
| # limitations under the License. | |||
| # ============================================================================ | |||
| echo "==============================================================================================================" | |||
| echo "Please run the scipt as: " | |||
| echo "sh scipts/run_distribute_train.sh DEVICE_NUM RANK_TABLE_FILE DATASET" | |||
| echo "for example: sh scipts/run_distribute_train.sh 8 /data/hccl.json /path/to/dataset" | |||
| echo "It is better to use absolute path." | |||
| echo "=================================================================================================================" | |||
| echo "After running the scipt, the network runs in the background. The log will be generated in train_x/log.txt" | |||
| export RANK_SIZE=$1 | |||
| export RANK_TABLE_FILE=$2 | |||
| DATASET=$3 | |||
| for((i=0;i<RANK_SIZE;i++)) | |||
| do | |||
| export DEVICE_ID=$i | |||
| rm -rf train_$i | |||
| mkdir ./train_$i | |||
| cp ./*.py ./train_$i | |||
| cp -r ./src ./train_$i | |||
| cd ./train_$i || exit | |||
| export RANK_ID=$i | |||
| echo "start training for rank $i, device $DEVICE_ID" | |||
| env > env.log | |||
| python train.py \ | |||
| --data_dir=$DATASET > log.txt 2>&1 & | |||
| cd ../ | |||
| done | |||
| @@ -0,0 +1,46 @@ | |||
| # Copyright 2020 Huawei Technologies Co., Ltd | |||
| # | |||
| # Licensed under the Apache License, Version 2.0 (the "License"); | |||
| # you may not use this file except in compliance with the License. | |||
| # You may obtain a copy of the License at | |||
| # | |||
| # http://www.apache.org/licenses/LICENSE-2.0 | |||
| # | |||
| # Unless required by applicable law or agreed to in writing, software | |||
| # distributed under the License is distributed on an "AS IS" BASIS, | |||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | |||
| # See the License for the specific language governing permissions and | |||
| # limitations under the License. | |||
| # ============================================================================ | |||
| """config""" | |||
| from easydict import EasyDict as ed | |||
| config = ed({ | |||
| "image_size": '224,224', | |||
| "num_classes": 1000, | |||
| "lr": 0.1, | |||
| "lr_scheduler": 'cosine_annealing', | |||
| "lr_epochs": '30,60,90,120', | |||
| "lr_gamma": 0.1, | |||
| "eta_min": 0, | |||
| "T_max": 120, | |||
| "max_epoch": 120, | |||
| "per_batch_size": 32, | |||
| "warmup_epochs": 0, | |||
| "weight_decay": 0.0001, | |||
| "momentum": 0.9, | |||
| "is_dynamic_loss_scale": 0, | |||
| "loss_scale": 1024, | |||
| "label_smooth": 0, | |||
| "label_smooth_factor": 0.1, | |||
| "log_interval": 100, | |||
| "ckpt_interval": 2000, | |||
| "ckpt_path": 'outputs/', | |||
| "is_save_on_master": 1, | |||
| "rank": 0, | |||
| "group_size": 1 | |||
| }) | |||
| @@ -0,0 +1,22 @@ | |||
| # Copyright 2020 Huawei Technologies Co., Ltd | |||
| # | |||
| # Licensed under the Apache License, Version 2.0 (the "License"); | |||
| # you may not use this file except in compliance with the License. | |||
| # You may obtain a copy of the License at | |||
| # | |||
| # http://www.apache.org/licenses/LICENSE-2.0 | |||
| # | |||
| # Unless required by applicable law or agreed to in writing, software | |||
| # distributed under the License is distributed on an "AS IS" BASIS, | |||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | |||
| # See the License for the specific language governing permissions and | |||
| # limitations under the License. | |||
| # ============================================================================ | |||
| """ | |||
| read dataset for classification | |||
| """ | |||
| from .classification import classification_dataset | |||
| __all__ = ["classification_dataset"] | |||
| @@ -0,0 +1,155 @@ | |||
| # Copyright 2020 Huawei Technologies Co., Ltd | |||
| # | |||
| # Licensed under the Apache License, Version 2.0 (the "License"); | |||
| # you may not use this file except in compliance with the License. | |||
| # You may obtain a copy of the License at | |||
| # | |||
| # http://www.apache.org/licenses/LICENSE-2.0 | |||
| # | |||
| # Unless required by applicable law or agreed to in writing, software | |||
| # distributed under the License is distributed on an "AS IS" BASIS, | |||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | |||
| # See the License for the specific language governing permissions and | |||
| # limitations under the License. | |||
| # ============================================================================ | |||
| """ | |||
| A function that returns a dataset for classification. | |||
| """ | |||
| import os | |||
| from PIL import Image, ImageFile | |||
| from mindspore import dtype as mstype | |||
| import mindspore.dataset as de | |||
| import mindspore.dataset.vision.c_transforms as vision_C | |||
| import mindspore.dataset.transforms.c_transforms as normal_C | |||
| from src.datasets.sampler import DistributedSampler | |||
| ImageFile.LOAD_TRUNCATED_IMAGES = True | |||
| class TxtDataset(): | |||
| """ | |||
| read dataset from txt | |||
| """ | |||
| def __init__(self, root, txt_name): | |||
| super(TxtDataset, self).__init__() | |||
| self.imgs = [] | |||
| self.labels = [] | |||
| fin = open(txt_name, "r") | |||
| for line in fin: | |||
| img_name, label = line.strip().split(' ') | |||
| self.imgs.append(os.path.join(root, img_name)) | |||
| self.labels.append(int(label)) | |||
| fin.close() | |||
| def __getitem__(self, index): | |||
| img = Image.open(self.imgs[index]).convert('RGB') | |||
| return img, self.labels[index] | |||
| def __len__(self): | |||
| return len(self.imgs) | |||
| def classification_dataset(data_dir, image_size, per_batch_size, max_epoch, rank, group_size, | |||
| mode='train', | |||
| input_mode='folder', | |||
| root='', | |||
| num_parallel_workers=None, | |||
| shuffle=None, | |||
| sampler=None, | |||
| class_indexing=None, | |||
| drop_remainder=True, | |||
| transform=None, | |||
| target_transform=None): | |||
| """ | |||
| A function that returns a dataset for classification. The mode of input dataset could be "folder" or "txt". | |||
| If it is "folder", all images within one folder have the same label. If it is "txt", all paths of images | |||
| are written into a textfile. | |||
| Args: | |||
| data_dir (str): Path to the root directory that contains the dataset for "input_mode="folder"". | |||
| Or path of the textfile that contains every image's path of the dataset. | |||
| image_size (str): Size of the input images. | |||
| per_batch_size (int): the batch size of evey step during training. | |||
| max_epoch (int): the number of epochs. | |||
| rank (int): The shard ID within num_shards (default=None). | |||
| group_size (int): Number of shards that the dataset should be divided | |||
| into (default=None). | |||
| mode (str): "train" or others. Default: " train". | |||
| input_mode (str): The form of the input dataset. "folder" or "txt". Default: "folder". | |||
| root (str): the images path for "input_mode="txt"". Default: " ". | |||
| num_parallel_workers (int): Number of workers to read the data. Default: None. | |||
| shuffle (bool): Whether or not to perform shuffle on the dataset | |||
| (default=None, performs shuffle). | |||
| sampler (Sampler): Object used to choose samples from the dataset. Default: None. | |||
| class_indexing (dict): A str-to-int mapping from folder name to index | |||
| (default=None, the folder names will be sorted | |||
| alphabetically and each class will be given a | |||
| unique index starting from 0). | |||
| Examples: | |||
| >>> from src.datasets.classification import classification_dataset | |||
| >>> # path to imagefolder directory. This directory needs to contain sub-directories which contain the images | |||
| >>> dataset_dir = "/path/to/imagefolder_directory" | |||
| >>> de_dataset = classification_dataset(train_data_dir, image_size=[224, 244], | |||
| >>> per_batch_size=64, max_epoch=100, | |||
| >>> rank=0, group_size=4) | |||
| >>> # Path of the textfile that contains every image's path of the dataset. | |||
| >>> dataset_dir = "/path/to/dataset/images/train.txt" | |||
| >>> images_dir = "/path/to/dataset/images" | |||
| >>> de_dataset = classification_dataset(train_data_dir, image_size=[224, 244], | |||
| >>> per_batch_size=64, max_epoch=100, | |||
| >>> rank=0, group_size=4, | |||
| >>> input_mode="txt", root=images_dir) | |||
| """ | |||
| mean = [0.485 * 255, 0.456 * 255, 0.406 * 255] | |||
| std = [0.229 * 255, 0.224 * 255, 0.225 * 255] | |||
| if transform is None: | |||
| if mode == 'train': | |||
| transform_img = [ | |||
| vision_C.RandomCropDecodeResize(image_size, scale=(0.08, 1.0), ratio=(0.75, 1.333)), | |||
| vision_C.RandomHorizontalFlip(prob=0.5), | |||
| vision_C.RandomColorAdjust(brightness=0.4, contrast=0.4, saturation=0.4), | |||
| vision_C.Normalize(mean=mean, std=std), | |||
| vision_C.HWC2CHW() | |||
| ] | |||
| else: | |||
| transform_img = [ | |||
| vision_C.Decode(), | |||
| vision_C.Resize((256, 256)), | |||
| vision_C.CenterCrop(image_size), | |||
| vision_C.Normalize(mean=mean, std=std), | |||
| vision_C.HWC2CHW() | |||
| ] | |||
| else: | |||
| transform_img = transform | |||
| if target_transform is None: | |||
| transform_label = [ | |||
| normal_C.TypeCast(mstype.int32) | |||
| ] | |||
| else: | |||
| transform_label = target_transform | |||
| if input_mode == 'folder': | |||
| de_dataset = de.ImageFolderDataset(data_dir, num_parallel_workers=num_parallel_workers, | |||
| shuffle=shuffle, sampler=sampler, class_indexing=class_indexing, | |||
| num_shards=group_size, shard_id=rank) | |||
| else: | |||
| dataset = TxtDataset(root, data_dir) | |||
| sampler = DistributedSampler(dataset, rank, group_size, shuffle=shuffle) | |||
| de_dataset = de.GeneratorDataset(dataset, ["image", "label"], sampler=sampler) | |||
| de_dataset.set_dataset_size(len(sampler)) | |||
| de_dataset = de_dataset.map(input_columns="image", num_parallel_workers=8, operations=transform_img) | |||
| de_dataset = de_dataset.map(input_columns="label", num_parallel_workers=8, operations=transform_label) | |||
| columns_to_project = ["image", "label"] | |||
| de_dataset = de_dataset.project(columns=columns_to_project) | |||
| de_dataset = de_dataset.batch(per_batch_size, drop_remainder=drop_remainder) | |||
| de_dataset = de_dataset.repeat(1) | |||
| return de_dataset | |||
| @@ -0,0 +1,51 @@ | |||
| # Copyright 2020 Huawei Technologies Co., Ltd | |||
| # | |||
| # Licensed under the Apache License, Version 2.0 (the "License"); | |||
| # you may not use this file except in compliance with the License. | |||
| # You may obtain a copy of the License at | |||
| # | |||
| # http://www.apache.org/licenses/LICENSE-2.0 | |||
| # | |||
| # Unless required by applicable law or agreed to in writing, software | |||
| # distributed under the License is distributed on an "AS IS" BASIS, | |||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | |||
| # See the License for the specific language governing permissions and | |||
| # limitations under the License. | |||
| # ============================================================================ | |||
| """ | |||
| shuffle and distribute sample | |||
| """ | |||
| import math | |||
| import numpy as np | |||
| class DistributedSampler(): | |||
| """ | |||
| function to distribute and shuffle sample | |||
| """ | |||
| def __init__(self, dataset, rank, group_size, shuffle=True, seed=0): | |||
| self.dataset = dataset | |||
| self.rank = rank | |||
| self.group_size = group_size | |||
| self.dataset_length = len(self.dataset) | |||
| self.num_samples = int(math.ceil(self.dataset_length * 1.0 / self.group_size)) | |||
| self.total_size = self.num_samples * self.group_size | |||
| self.shuffle = shuffle | |||
| self.seed = seed | |||
| def __iter__(self): | |||
| if self.shuffle: | |||
| self.seed = (self.seed + 1) & 0xffffffff | |||
| np.random.seed(self.seed) | |||
| indices = np.random.permutation(self.dataset_length).tolist() | |||
| else: | |||
| indices = list(range(len(self.dataset_length))) | |||
| indices += indices[:(self.total_size - len(indices))] | |||
| indices = indices[self.rank::self.group_size] | |||
| return iter(indices) | |||
| def __len__(self): | |||
| return self.num_samples | |||
| @@ -0,0 +1,19 @@ | |||
| # Copyright 2020 Huawei Technologies Co., Ltd | |||
| # | |||
| # Licensed under the Apache License, Version 2.0 (the "License"); | |||
| # you may not use this file except in compliance with the License. | |||
| # You may obtain a copy of the License at | |||
| # | |||
| # http://www.apache.org/licenses/LICENSE-2.0 | |||
| # | |||
| # Unless required by applicable law or agreed to in writing, software | |||
| # distributed under the License is distributed on an "AS IS" BASIS, | |||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | |||
| # See the License for the specific language governing permissions and | |||
| # limitations under the License. | |||
| # ============================================================================ | |||
| """ | |||
| loss function | |||
| """ | |||
| from .crossentropy import * | |||
| @@ -0,0 +1,44 @@ | |||
| # Copyright 2020 Huawei Technologies Co., Ltd | |||
| # | |||
| # Licensed under the Apache License, Version 2.0 (the "License"); | |||
| # you may not use this file except in compliance with the License. | |||
| # You may obtain a copy of the License at | |||
| # | |||
| # http://www.apache.org/licenses/LICENSE-2.0 | |||
| # | |||
| # Unless required by applicable law or agreed to in writing, software | |||
| # distributed under the License is distributed on an "AS IS" BASIS, | |||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | |||
| # See the License for the specific language governing permissions and | |||
| # limitations under the License. | |||
| # ============================================================================ | |||
| """ | |||
| loss function CrossEntropy | |||
| """ | |||
| from mindspore.nn.loss.loss import _Loss | |||
| from mindspore.ops import operations as P | |||
| from mindspore.ops import functional as F | |||
| from mindspore import Tensor | |||
| from mindspore.common import dtype as mstype | |||
| import mindspore.nn as nn | |||
| class CrossEntropy(_Loss): | |||
| """ | |||
| loss function CrossEntropy | |||
| """ | |||
| def __init__(self, smooth_factor=0., num_classes=1000): | |||
| super(CrossEntropy, self).__init__() | |||
| self.onehot = P.OneHot() | |||
| self.on_value = Tensor(1.0 - smooth_factor, mstype.float32) | |||
| self.off_value = Tensor(1.0 * smooth_factor / (num_classes -1), mstype.float32) | |||
| self.ce = nn.SoftmaxCrossEntropyWithLogits() | |||
| self.mean = P.ReduceMean(False) | |||
| def construct(self, logit, label): | |||
| one_hot_label = self.onehot(label, | |||
| F.shape(logit)[1], self.on_value, self.off_value) | |||
| loss = self.ce(logit, one_hot_label) | |||
| loss = self.mean(loss, 0) | |||
| return loss | |||
| @@ -0,0 +1,19 @@ | |||
| # Copyright 2020 Huawei Technologies Co., Ltd | |||
| # | |||
| # Licensed under the Apache License, Version 2.0 (the "License"); | |||
| # you may not use this file except in compliance with the License. | |||
| # You may obtain a copy of the License at | |||
| # | |||
| # http://www.apache.org/licenses/LICENSE-2.0 | |||
| # | |||
| # Unless required by applicable law or agreed to in writing, software | |||
| # distributed under the License is distributed on an "AS IS" BASIS, | |||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | |||
| # See the License for the specific language governing permissions and | |||
| # limitations under the License. | |||
| # ============================================================================ | |||
| """ | |||
| learning rate scheduler | |||
| """ | |||
| from .lr_scheduler import * | |||
| @@ -0,0 +1,656 @@ | |||
| # Copyright 2020 Huawei Technologies Co., Ltd | |||
| # | |||
| # Licensed under the Apache License, Version 2.0 (the "License"); | |||
| # you may not use this file except in compliance with the License. | |||
| # You may obtain a copy of the License at | |||
| # | |||
| # http://www.apache.org/licenses/LICENSE-2.0 | |||
| # | |||
| # Unless required by applicable law or agreed to in writing, software | |||
| # distributed under the License is distributed on an "AS IS" BASIS, | |||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | |||
| # See the License for the specific language governing permissions and | |||
| # limitations under the License. | |||
| # ============================================================================ | |||
| """ | |||
| learning rate scheduler | |||
| """ | |||
| import math | |||
| from collections import Counter | |||
| import numpy as np | |||
| __all__ = ["LambdaLR", "MultiplicativeLR", "StepLR", "MultiStepLR", "ExponentialLR", | |||
| "CosineAnnealingLR", "CyclicLR", "CosineAnnealingWarmRestarts", "OneCycleLR"] | |||
| class _WarmUp(): | |||
| def __init__(self, warmup_init_lr): | |||
| self.warmup_init_lr = warmup_init_lr | |||
| def get_lr(self): | |||
| # Get learning rate during warmup | |||
| raise NotImplementedError | |||
| class _LinearWarmUp(_WarmUp): | |||
| """ | |||
| linear warmup function | |||
| """ | |||
| def __init__(self, lr, warmup_epochs, steps_per_epoch, warmup_init_lr=0): | |||
| self.base_lr = lr | |||
| self.warmup_init_lr = warmup_init_lr | |||
| self.warmup_steps = int(warmup_epochs * steps_per_epoch) | |||
| super(_LinearWarmUp, self).__init__(warmup_init_lr) | |||
| def get_warmup_steps(self): | |||
| return self.warmup_steps | |||
| def get_lr(self, current_step): | |||
| lr_inc = (float(self.base_lr) - float(self.warmup_init_lr)) / float(self.warmup_steps) | |||
| lr = float(self.warmup_init_lr) + lr_inc * current_step | |||
| return lr | |||
| class _ConstWarmUp(_WarmUp): | |||
| def get_lr(self): | |||
| return self.warmup_init_lr | |||
| class _LRScheduler(): | |||
| def __init__(self, lr, max_epoch, steps_per_epoch): | |||
| self.base_lr = lr | |||
| self.steps_per_epoch = steps_per_epoch | |||
| self.total_steps = int(max_epoch * steps_per_epoch) | |||
| def get_lr(self): | |||
| # Compute learning rate using chainable form of the scheduler | |||
| raise NotImplementedError | |||
| class LambdaLR(_LRScheduler): | |||
| """Sets the learning rate to the initial lr times a given function. | |||
| Args: | |||
| lr (float): Initial learning rate which is the | |||
| lower boundary in the cycle. | |||
| steps_per_epoch (int): The number of steps per epoch to train for. This is | |||
| used along with epochs in order to infer the total number of steps in the cycle. | |||
| max_epoch (int): The number of epochs to train for. This is used along | |||
| with steps_per_epoch in order to infer the total number of steps in the cycle. | |||
| lr_lambda (function or list): A function which computes a multiplicative | |||
| factor given an integer parameter epoch. | |||
| warmup_epochs (int): The number of epochs to Warmup. | |||
| Default: 0 | |||
| Example: | |||
| >>> # Assuming optimizer has two groups. | |||
| >>> lambda1 = lambda epoch: epoch // 30 | |||
| >>> scheduler = LambdaLR(lr=0.1, lr_lambda=lambda1, steps_per_epoch=5000, | |||
| >>> max_epoch=90, warmup_epochs=0) | |||
| >>> lr = scheduler.get_lr() | |||
| """ | |||
| def __init__(self, lr, lr_lambda, steps_per_epoch, max_epoch, warmup_epochs=0): | |||
| self.lr_lambda = lr_lambda | |||
| self.warmup = _LinearWarmUp(lr, warmup_epochs, steps_per_epoch) | |||
| super(LambdaLR, self).__init__(lr, max_epoch, steps_per_epoch) | |||
| def get_lr(self): | |||
| warmup_steps = self.warmup.get_warmup_steps() | |||
| lr_each_step = [] | |||
| for i in range(self.total_steps): | |||
| if i < warmup_steps: | |||
| lr = self.warmup.get_lr(i+1) | |||
| else: | |||
| cur_ep = i // self.steps_per_epoch | |||
| lr = self.base_lr * self.lr_lambda(cur_ep) | |||
| lr_each_step.append(lr) | |||
| return np.array(lr_each_step).astype(np.float32) | |||
| class MultiplicativeLR(_LRScheduler): | |||
| """Multiply the learning rate by the factor given | |||
| in the specified function. | |||
| Args: | |||
| lr_lambda (function or list): A function which computes a multiplicative | |||
| factor given an integer parameter epoch,. | |||
| Example: | |||
| >>> lmbda = lambda epoch: 0.95 | |||
| >>> scheduler = MultiplicativeLR(lr=0.1, lr_lambda=lambda1, steps_per_epoch=5000, | |||
| >>> max_epoch=90, warmup_epochs=0) | |||
| >>> lr = scheduler.get_lr() | |||
| """ | |||
| def __init__(self, lr, lr_lambda, steps_per_epoch, max_epoch, warmup_epochs=0): | |||
| self.lr_lambda = lr_lambda | |||
| self.warmup = _LinearWarmUp(lr, warmup_epochs, steps_per_epoch) | |||
| super(MultiplicativeLR, self).__init__(lr, max_epoch, steps_per_epoch) | |||
| def get_lr(self): | |||
| warmup_steps = self.warmup.get_warmup_steps() | |||
| lr_each_step = [] | |||
| current_lr = self.base_lr | |||
| for i in range(self.total_steps): | |||
| if i < warmup_steps: | |||
| lr = self.warmup.get_lr(i+1) | |||
| else: | |||
| cur_ep = i // self.steps_per_epoch | |||
| if i % self.steps_per_epoch == 0 and cur_ep > 0: | |||
| current_lr = current_lr * self.lr_lambda(cur_ep) | |||
| lr = current_lr | |||
| lr_each_step.append(lr) | |||
| return np.array(lr_each_step).astype(np.float32) | |||
| class StepLR(_LRScheduler): | |||
| """Decays the learning rate by gamma every epoch_size epochs. | |||
| Args: | |||
| lr (float): Initial learning rate which is the | |||
| lower boundary in the cycle. | |||
| steps_per_epoch (int): The number of steps per epoch to train for. This is | |||
| used along with epochs in order to infer the total number of steps in the cycle. | |||
| max_epoch (int): The number of epochs to train for. This is used along | |||
| with steps_per_epoch in order to infer the total number of steps in the cycle. | |||
| epoch_size (int): Period of learning rate decay. | |||
| gamma (float): Multiplicative factor of learning rate decay. | |||
| Default: 0.1. | |||
| warmup_epochs (int): The number of epochs to Warmup. | |||
| Default: 0 | |||
| Example: | |||
| >>> # Assuming optimizer uses lr = 0.05 for all groups | |||
| >>> # lr = 0.05 if epoch < 30 | |||
| >>> # lr = 0.005 if 30 <= epoch < 60 | |||
| >>> # lr = 0.0005 if 60 <= epoch < 90 | |||
| >>> # ... | |||
| >>> scheduler = StepLR(lr=0.1, epoch_size=30, gamma=0.1, steps_per_epoch=5000, | |||
| >>> max_epoch=90, warmup_epochs=0) | |||
| >>> lr = scheduler.get_lr() | |||
| """ | |||
| def __init__(self, lr, epoch_size, gamma, steps_per_epoch, max_epoch, warmup_epochs=0): | |||
| self.epoch_size = epoch_size | |||
| self.gamma = gamma | |||
| self.warmup = _LinearWarmUp(lr, warmup_epochs, steps_per_epoch) | |||
| super(StepLR, self).__init__(lr, max_epoch, steps_per_epoch) | |||
| def get_lr(self): | |||
| warmup_steps = self.warmup.get_warmup_steps() | |||
| lr_each_step = [] | |||
| for i in range(self.total_steps): | |||
| if i < warmup_steps: | |||
| lr = self.warmup.get_lr(i+1) | |||
| else: | |||
| cur_ep = i // self.steps_per_epoch | |||
| lr = self.base_lr * self.gamma**(cur_ep // self.epoch_size) | |||
| lr_each_step.append(lr) | |||
| return np.array(lr_each_step).astype(np.float32) | |||
| class MultiStepLR(_LRScheduler): | |||
| """Decays the learning rate by gamma once the number of epoch reaches one | |||
| of the milestones. | |||
| Args: | |||
| lr (float): Initial learning rate which is the | |||
| lower boundary in the cycle. | |||
| steps_per_epoch (int): The number of steps per epoch to train for. This is | |||
| used along with epochs in order to infer the total number of steps in the cycle. | |||
| max_epoch (int): The number of epochs to train for. This is used along | |||
| with steps_per_epoch in order to infer the total number of steps in the cycle. | |||
| milestones (list): List of epoch indices. Must be increasing. | |||
| gamma (float): Multiplicative factor of learning rate decay. | |||
| Default: 0.1. | |||
| warmup_epochs (int): The number of epochs to Warmup. | |||
| Default: 0 | |||
| Example: | |||
| >>> # Assuming optimizer uses lr = 0.05 for all groups | |||
| >>> # lr = 0.05 if epoch < 30 | |||
| >>> # lr = 0.005 if 30 <= epoch < 80 | |||
| >>> # lr = 0.0005 if epoch >= 80 | |||
| >>> scheduler = MultiStepLR(lr=0.1, milestones=[30,80], gamma=0.1, steps_per_epoch=5000, | |||
| >>> max_epoch=90, warmup_epochs=0) | |||
| >>> lr = scheduler.get_lr() | |||
| """ | |||
| def __init__(self, lr, milestones, gamma, steps_per_epoch, max_epoch, warmup_epochs=0): | |||
| self.milestones = Counter(milestones) | |||
| self.gamma = gamma | |||
| self.warmup = _LinearWarmUp(lr, warmup_epochs, steps_per_epoch) | |||
| super(MultiStepLR, self).__init__(lr, max_epoch, steps_per_epoch) | |||
| def get_lr(self): | |||
| warmup_steps = self.warmup.get_warmup_steps() | |||
| lr_each_step = [] | |||
| current_lr = self.base_lr | |||
| for i in range(self.total_steps): | |||
| if i < warmup_steps: | |||
| lr = self.warmup.get_lr(i+1) | |||
| else: | |||
| cur_ep = i // self.steps_per_epoch | |||
| if i % self.steps_per_epoch == 0 and cur_ep in self.milestones: | |||
| current_lr = current_lr * self.gamma | |||
| lr = current_lr | |||
| lr_each_step.append(lr) | |||
| return np.array(lr_each_step).astype(np.float32) | |||
| class ExponentialLR(_LRScheduler): | |||
| """Decays the learning rate of each parameter group by gamma every epoch. | |||
| Args: | |||
| lr (float): Initial learning rate which is the | |||
| lower boundary in the cycle. | |||
| gamma (float): Multiplicative factor of learning rate decay. | |||
| steps_per_epoch (int): The number of steps per epoch to train for. This is | |||
| used along with epochs in order to infer the total number of steps in the cycle. | |||
| max_epoch (int): The number of epochs to train for. This is used along | |||
| with steps_per_epoch in order to infer the total number of steps in the cycle. | |||
| warmup_epochs (int): The number of epochs to Warmup. | |||
| Default: 0 | |||
| """ | |||
| def __init__(self, lr, gamma, steps_per_epoch, max_epoch, warmup_epochs=0): | |||
| self.gamma = gamma | |||
| self.warmup = _LinearWarmUp(lr, warmup_epochs, steps_per_epoch) | |||
| super(ExponentialLR, self).__init__(lr, max_epoch, steps_per_epoch) | |||
| def get_lr(self): | |||
| warmup_steps = self.warmup.get_warmup_steps() | |||
| lr_each_step = [] | |||
| current_lr = self.base_lr | |||
| for i in range(self.total_steps): | |||
| if i < warmup_steps: | |||
| lr = self.warmup.get_lr(i+1) | |||
| else: | |||
| if i % self.steps_per_epoch == 0 and i > 0: | |||
| current_lr = current_lr * self.gamma | |||
| lr = current_lr | |||
| lr_each_step.append(lr) | |||
| return np.array(lr_each_step).astype(np.float32) | |||
| class CosineAnnealingLR(_LRScheduler): | |||
| r"""Set the learning rate using a cosine annealing schedule, where | |||
| :math:`\eta_{max}` is set to the initial lr and :math:`T_{cur}` is the | |||
| number of epochs since the last restart in SGDR: | |||
| .. math:: | |||
| \begin{aligned} | |||
| \eta_t & = \eta_{min} + \frac{1}{2}(\eta_{max} - \eta_{min})\left(1 | |||
| + \cos\left(\frac{T_{cur}}{T_{max}}\pi\right)\right), | |||
| & T_{cur} \neq (2k+1)T_{max}; \\ | |||
| \eta_{t+1} & = \eta_{t} + \frac{1}{2}(\eta_{max} - \eta_{min}) | |||
| \left(1 - \cos\left(\frac{1}{T_{max}}\pi\right)\right), | |||
| & T_{cur} = (2k+1)T_{max}. | |||
| \end{aligned} | |||
| It has been proposed in | |||
| `SGDR: Stochastic Gradient Descent with Warm Restarts`_. Note that this only | |||
| implements the cosine annealing part of SGDR, and not the restarts. | |||
| Args: | |||
| lr (float): Initial learning rate which is the | |||
| lower boundary in the cycle. | |||
| T_max (int): Maximum number of iterations. | |||
| eta_min (float): Minimum learning rate. Default: 0. | |||
| steps_per_epoch (int): The number of steps per epoch to train for. This is | |||
| used along with epochs in order to infer the total number of steps in the cycle. | |||
| max_epoch (int): The number of epochs to train for. This is used along | |||
| with steps_per_epoch in order to infer the total number of steps in the cycle. | |||
| warmup_epochs (int): The number of epochs to Warmup. | |||
| Default: 0 | |||
| .. _SGDR\: Stochastic Gradient Descent with Warm Restarts: | |||
| https://arxiv.org/abs/1608.03983 | |||
| """ | |||
| def __init__(self, lr, T_max, steps_per_epoch, max_epoch, warmup_epochs=0, eta_min=0): | |||
| self.T_max = T_max | |||
| self.eta_min = eta_min | |||
| self.warmup = _LinearWarmUp(lr, warmup_epochs, steps_per_epoch) | |||
| super(CosineAnnealingLR, self).__init__(lr, max_epoch, steps_per_epoch) | |||
| def get_lr(self): | |||
| warmup_steps = self.warmup.get_warmup_steps() | |||
| lr_each_step = [] | |||
| current_lr = self.base_lr | |||
| for i in range(self.total_steps): | |||
| if i < warmup_steps: | |||
| lr = self.warmup.get_lr(i+1) | |||
| else: | |||
| cur_ep = i // self.steps_per_epoch | |||
| if i % self.steps_per_epoch == 0 and i > 0: | |||
| current_lr = self.eta_min + \ | |||
| (self.base_lr - self.eta_min) * (1. + math.cos(math.pi*cur_ep / self.T_max)) / 2 | |||
| lr = current_lr | |||
| lr_each_step.append(lr) | |||
| return np.array(lr_each_step).astype(np.float32) | |||
| class CyclicLR(_LRScheduler): | |||
| r"""Sets the learning rate according to cyclical learning rate policy (CLR). | |||
| The policy cycles the learning rate between two boundaries with a constant | |||
| frequency, as detailed in the paper `Cyclical Learning Rates for Training | |||
| Neural Networks`_. The distance between the two boundaries can be scaled on | |||
| a per-iteration or per-cycle basis. | |||
| Cyclical learning rate policy changes the learning rate after every batch. | |||
| This class has three built-in policies, as put forth in the paper: | |||
| * "triangular": A basic triangular cycle without amplitude scaling. | |||
| * "triangular2": A basic triangular cycle that scales initial amplitude by half each cycle. | |||
| * "exp_range": A cycle that scales initial amplitude by :math:`\text{gamma}^{\text{cycle iterations}}` | |||
| at each cycle iteration. | |||
| This implementation was adapted from the github repo: `bckenstler/CLR`_ | |||
| Args: | |||
| lr (float): Initial learning rate which is the | |||
| lower boundary in the cycle. | |||
| max_lr (float): Upper learning rate boundaries in the cycle. | |||
| Functionally, it defines the cycle amplitude (max_lr - base_lr). | |||
| The lr at any cycle is the sum of base_lr and some scaling | |||
| of the amplitude; therefore max_lr may not actually be reached | |||
| depending on scaling function. | |||
| steps_per_epoch (int): The number of steps per epoch to train for. This is | |||
| used along with epochs in order to infer the total number of steps in the cycle. | |||
| max_epoch (int): The number of epochs to train for. This is used along | |||
| with steps_per_epoch in order to infer the total number of steps in the cycle. | |||
| step_size_up (int): Number of training iterations in the | |||
| increasing half of a cycle. Default: 2000 | |||
| step_size_down (int): Number of training iterations in the | |||
| decreasing half of a cycle. If step_size_down is None, | |||
| it is set to step_size_up. Default: None | |||
| mode (str): One of {triangular, triangular2, exp_range}. | |||
| Values correspond to policies detailed above. | |||
| If scale_fn is not None, this argument is ignored. | |||
| Default: 'triangular' | |||
| gamma (float): Constant in 'exp_range' scaling function: | |||
| gamma**(cycle iterations) | |||
| Default: 1.0 | |||
| scale_fn (function): Custom scaling policy defined by a single | |||
| argument lambda function, where | |||
| 0 <= scale_fn(x) <= 1 for all x >= 0. | |||
| If specified, then 'mode' is ignored. | |||
| Default: None | |||
| scale_mode (str): {'cycle', 'iterations'}. | |||
| Defines whether scale_fn is evaluated on | |||
| cycle number or cycle iterations (training | |||
| iterations since start of cycle). | |||
| Default: 'cycle' | |||
| warmup_epochs (int): The number of epochs to Warmup. | |||
| Default: 0 | |||
| .. _Cyclical Learning Rates for Training Neural Networks: https://arxiv.org/abs/1506.01186 | |||
| .. _bckenstler/CLR: https://github.com/bckenstler/CLR | |||
| """ | |||
| def __init__(self, | |||
| lr, | |||
| max_lr, | |||
| steps_per_epoch, | |||
| max_epoch, | |||
| step_size_up=2000, | |||
| step_size_down=None, | |||
| mode='triangular', | |||
| gamma=1., | |||
| scale_fn=None, | |||
| scale_mode='cycle', | |||
| warmup_epochs=0): | |||
| self.max_lr = max_lr | |||
| step_size_up = float(step_size_up) | |||
| step_size_down = float(step_size_down) if step_size_down is not None else step_size_up | |||
| self.total_size = step_size_up + step_size_down | |||
| self.step_ratio = step_size_up / self.total_size | |||
| if mode not in ['triangular', 'triangular2', 'exp_range'] \ | |||
| and scale_fn is None: | |||
| raise ValueError('mode is invalid and scale_fn is None') | |||
| self.mode = mode | |||
| self.gamma = gamma | |||
| if scale_fn is None: | |||
| if self.mode == 'triangular': | |||
| self.scale_fn = self._triangular_scale_fn | |||
| self.scale_mode = 'cycle' | |||
| elif self.mode == 'triangular2': | |||
| self.scale_fn = self._triangular2_scale_fn | |||
| self.scale_mode = 'cycle' | |||
| elif self.mode == 'exp_range': | |||
| self.scale_fn = self._exp_range_scale_fn | |||
| self.scale_mode = 'iterations' | |||
| else: | |||
| self.scale_fn = scale_fn | |||
| self.scale_mode = scale_mode | |||
| self.warmup = _LinearWarmUp(lr, warmup_epochs, steps_per_epoch) | |||
| super(CyclicLR, self).__init__(lr, max_epoch, steps_per_epoch) | |||
| def _triangular_scale_fn(self, x): | |||
| return 1. | |||
| def _triangular2_scale_fn(self, x): | |||
| return 1 / (2. ** (x - 1)) | |||
| def _exp_range_scale_fn(self, x): | |||
| return self.gamma**(x) | |||
| def get_lr(self): | |||
| warmup_steps = self.warmup.get_warmup_steps() | |||
| lr_each_step = [] | |||
| for i in range(self.total_steps): | |||
| if i < warmup_steps: | |||
| lr = self.warmup.get_lr(i+1) | |||
| else: | |||
| # Calculates the learning rate at batch index. | |||
| cycle = math.floor(1 + i / self.total_size) | |||
| x = 1. + i / self.total_size - cycle | |||
| if x <= self.step_ratio: | |||
| scale_factor = x / self.step_ratio | |||
| else: | |||
| scale_factor = (x - 1) / (self.step_ratio - 1) | |||
| base_height = (self.max_lr - self.base_lr) * scale_factor | |||
| if self.scale_mode == 'cycle': | |||
| lr = self.base_lr + base_height * self.scale_fn(cycle) | |||
| else: | |||
| lr = self.base_lr + base_height * self.scale_fn(i) | |||
| lr_each_step.append(lr) | |||
| return np.array(lr_each_step).astype(np.float32) | |||
| class CosineAnnealingWarmRestarts(_LRScheduler): | |||
| r"""Set the learning rate using a cosine annealing schedule, where | |||
| :math:`\eta_{max}` is set to the initial lr, :math:`T_{cur}` is the | |||
| number of epochs since the last restart and :math:`T_{i}` is the number | |||
| of epochs between two warm restarts in SGDR: | |||
| .. math:: | |||
| \eta_t = \eta_{min} + \frac{1}{2}(\eta_{max} - \eta_{min})\left(1 + | |||
| \cos\left(\frac{T_{cur}}{T_{i}}\pi\right)\right) | |||
| When :math:`T_{cur}=T_{i}`, set :math:`\eta_t = \eta_{min}`. | |||
| When :math:`T_{cur}=0` after restart, set :math:`\eta_t=\eta_{max}`. | |||
| It has been proposed in | |||
| `SGDR: Stochastic Gradient Descent with Warm Restarts`_. | |||
| Args: | |||
| lr (float): Initial learning rate. | |||
| steps_per_epoch (int): The number of steps per epoch to train for. This is | |||
| used along with epochs in order to infer the total number of steps in the cycle. | |||
| max_epoch (int): The number of epochs to train for. This is used along | |||
| with steps_per_epoch in order to infer the total number of steps in the cycle. | |||
| T_0 (int): Number of iterations for the first restart. | |||
| T_mult (int, optional): A factor increases :math:`T_{i}` after a restart. Default: 1. | |||
| eta_min (float, optional): Minimum learning rate. Default: 0. | |||
| warmup_epochs (int): The number of epochs to Warmup. | |||
| Default: 0 | |||
| .. _SGDR\: Stochastic Gradient Descent with Warm Restarts: | |||
| https://arxiv.org/abs/1608.03983 | |||
| """ | |||
| def __init__(self, lr, steps_per_epoch, max_epoch, T_0, T_mult=1, eta_min=0, warmup_epochs=0): | |||
| if T_0 <= 0 or not isinstance(T_0, int): | |||
| raise ValueError("Expected positive integer T_0, but got {}".format(T_0)) | |||
| if T_mult < 1 or not isinstance(T_mult, int): | |||
| raise ValueError("Expected integer T_mult >= 1, but got {}".format(T_mult)) | |||
| self.T_0 = T_0 | |||
| self.T_i = T_0 | |||
| self.T_mult = T_mult | |||
| self.eta_min = eta_min | |||
| self.T_cur = 0 | |||
| self.warmup = _LinearWarmUp(lr, warmup_epochs, steps_per_epoch) | |||
| super(CosineAnnealingWarmRestarts, self).__init__(lr, max_epoch, steps_per_epoch) | |||
| def get_lr(self): | |||
| warmup_steps = self.warmup.get_warmup_steps() | |||
| lr_each_step = [] | |||
| for i in range(self.total_steps): | |||
| if i < warmup_steps: | |||
| lr = self.warmup.get_lr(i+1) | |||
| else: | |||
| if i % self.steps_per_epoch == 0 and i > 0: | |||
| self.T_cur += 1 | |||
| if self.T_cur >= self.T_i: | |||
| self.T_cur = self.T_cur - self.T_i | |||
| self.T_i = self.T_i * self.T_mult | |||
| lr = self.eta_min + (self.base_lr - self.eta_min) * \ | |||
| (1 + math.cos(math.pi * self.T_cur / self.T_i)) / 2 | |||
| lr_each_step.append(lr) | |||
| return np.array(lr_each_step).astype(np.float32) | |||
| class OneCycleLR(_LRScheduler): | |||
| r"""Sets the learning rate of each parameter group according to the | |||
| 1cycle learning rate policy. The 1cycle policy anneals the learning | |||
| rate from an initial learning rate to some maximum learning rate and then | |||
| from that maximum learning rate to some minimum learning rate much lower | |||
| than the initial learning rate. | |||
| This policy was initially described in the paper `Super-Convergence: | |||
| Very Fast Training of Neural Networks Using Large Learning Rates`_. | |||
| The 1cycle learning rate policy changes the learning rate after every batch. | |||
| This scheduler is not chainable. | |||
| Args: | |||
| lr (float): Initial learning rate. | |||
| steps_per_epoch (int): The number of steps per epoch to train for. This is | |||
| used along with epochs in order to infer the total number of steps in the cycle. | |||
| max_epoch (int): The number of epochs to train for. This is used along | |||
| with steps_per_epoch in order to infer the total number of steps in the cycle. | |||
| pct_start (float): The percentage of the cycle (in number of steps) spent | |||
| increasing the learning rate. | |||
| Default: 0.3 | |||
| anneal_strategy (str): {'cos', 'linear'} | |||
| Specifies the annealing strategy: "cos" for cosine annealing, "linear" for | |||
| linear annealing. | |||
| Default: 'cos' | |||
| div_factor (float): Determines the max learning rate via | |||
| max_lr = lr * div_factor | |||
| Default: 25 | |||
| final_div_factor (float): Determines the minimum learning rate via | |||
| min_lr = lr / final_div_factor | |||
| Default: 1e4 | |||
| warmup_epochs (int): The number of epochs to Warmup. | |||
| Default: 0 | |||
| .. _Super-Convergence\: Very Fast Training of Neural Networks Using Large Learning Rates: | |||
| https://arxiv.org/abs/1708.07120 | |||
| """ | |||
| def __init__(self, | |||
| lr, | |||
| steps_per_epoch, | |||
| max_epoch, | |||
| pct_start=0.3, | |||
| anneal_strategy='cos', | |||
| div_factor=25., | |||
| final_div_factor=1e4, | |||
| warmup_epochs=0): | |||
| self.warmup = _LinearWarmUp(lr, warmup_epochs, steps_per_epoch) | |||
| super(OneCycleLR, self).__init__(lr, max_epoch, steps_per_epoch) | |||
| self.step_size_up = float(pct_start * self.total_steps) - 1 | |||
| self.step_size_down = float(self.total_steps - self.step_size_up) - 1 | |||
| # Validate pct_start | |||
| if pct_start < 0 or pct_start > 1 or not isinstance(pct_start, float): | |||
| raise ValueError("Expected float between 0 and 1 pct_start, but got {}".format(pct_start)) | |||
| # Validate anneal_strategy | |||
| if anneal_strategy not in ['cos', 'linear']: | |||
| raise ValueError("anneal_strategy must by one of 'cos' or 'linear', instead got {}".format(anneal_strategy)) | |||
| if anneal_strategy == 'cos': | |||
| self.anneal_func = self._annealing_cos | |||
| elif anneal_strategy == 'linear': | |||
| self.anneal_func = self._annealing_linear | |||
| # Initialize learning rate variables | |||
| self.max_lr = lr * div_factor | |||
| self.min_lr = lr / final_div_factor | |||
| def _annealing_cos(self, start, end, pct): | |||
| "Cosine anneal from `start` to `end` as pct goes from 0.0 to 1.0." | |||
| cos_out = math.cos(math.pi * pct) + 1 | |||
| return end + (start - end) / 2.0 * cos_out | |||
| def _annealing_linear(self, start, end, pct): | |||
| "Linearly anneal from `start` to `end` as pct goes from 0.0 to 1.0." | |||
| return (end - start) * pct + start | |||
| def get_lr(self): | |||
| warmup_steps = self.warmup.get_warmup_steps() | |||
| lr_each_step = [] | |||
| for i in range(self.total_steps): | |||
| if i < warmup_steps: | |||
| lr = self.warmup.get_lr(i+1) | |||
| else: | |||
| if i <= self.step_size_up: | |||
| lr = self.anneal_func(self.base_lr, self.max_lr, i / self.step_size_up) | |||
| else: | |||
| down_step_num = i - self.step_size_up | |||
| lr = self.anneal_func(self.max_lr, self.min_lr, down_step_num / self.step_size_down) | |||
| lr_each_step.append(lr) | |||
| return np.array(lr_each_step).astype(np.float32) | |||
| @@ -0,0 +1,18 @@ | |||
| # Copyright 2020 Huawei Technologies Co., Ltd | |||
| # | |||
| # Licensed under the Apache License, Version 2.0 (the "License"); | |||
| # you may not use this file except in compliance with the License. | |||
| # You may obtain a copy of the License at | |||
| # | |||
| # http://www.apache.org/licenses/LICENSE-2.0 | |||
| # | |||
| # Unless required by applicable law or agreed to in writing, software | |||
| # distributed under the License is distributed on an "AS IS" BASIS, | |||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | |||
| # See the License for the specific language governing permissions and | |||
| # limitations under the License. | |||
| # ============================================================================ | |||
| """ | |||
| densenet network | |||
| """ | |||
| from .densenet import DenseNet121 | |||
| @@ -0,0 +1,230 @@ | |||
| # Copyright 2020 Huawei Technologies Co., Ltd | |||
| # | |||
| # Licensed under the Apache License, Version 2.0 (the "License"); | |||
| # you may not use this file except in compliance with the License. | |||
| # You may obtain a copy of the License at | |||
| # | |||
| # http://www.apache.org/licenses/LICENSE-2.0 | |||
| # | |||
| # Unless required by applicable law or agreed to in writing, software | |||
| # distributed under the License is distributed on an "AS IS" BASIS, | |||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | |||
| # See the License for the specific language governing permissions and | |||
| # limitations under the License. | |||
| # ============================================================================ | |||
| """ | |||
| model architecture of densenet | |||
| """ | |||
| import math | |||
| from collections import OrderedDict | |||
| import mindspore.nn as nn | |||
| from mindspore.ops import operations as P | |||
| from mindspore.common import initializer as init | |||
| from src.utils.var_init import default_recurisive_init, KaimingNormal | |||
| __all__ = ["DenseNet121"] | |||
| class GlobalAvgPooling(nn.Cell): | |||
| """ | |||
| GlobalAvgPooling function. | |||
| """ | |||
| def __init__(self): | |||
| super(GlobalAvgPooling, self).__init__() | |||
| self.mean = P.ReduceMean(True) | |||
| self.shape = P.Shape() | |||
| self.reshape = P.Reshape() | |||
| def construct(self, x): | |||
| x = self.mean(x, (2, 3)) | |||
| b, c, _, _ = self.shape(x) | |||
| x = self.reshape(x, (b, c)) | |||
| return x | |||
| class CommonHead(nn.Cell): | |||
| def __init__(self, num_classes, out_channels): | |||
| super(CommonHead, self).__init__() | |||
| self.avgpool = GlobalAvgPooling() | |||
| self.fc = nn.Dense(out_channels, num_classes, has_bias=True) | |||
| def construct(self, x): | |||
| x = self.avgpool(x) | |||
| x = self.fc(x) | |||
| return x | |||
| def conv7x7(in_channels, out_channels, stride=1, padding=3, has_bias=False): | |||
| return nn.Conv2d(in_channels, out_channels, kernel_size=7, stride=stride, has_bias=has_bias, | |||
| padding=padding, pad_mode="pad") | |||
| def conv3x3(in_channels, out_channels, stride=1, padding=1, has_bias=False): | |||
| return nn.Conv2d(in_channels, out_channels, kernel_size=3, stride=stride, has_bias=has_bias, | |||
| padding=padding, pad_mode="pad") | |||
| def conv1x1(in_channels, out_channels, stride=1, padding=0, has_bias=False): | |||
| return nn.Conv2d(in_channels, out_channels, kernel_size=1, stride=stride, has_bias=has_bias, | |||
| padding=padding, pad_mode="pad") | |||
| class _DenseLayer(nn.Cell): | |||
| """ | |||
| the dense layer, include 2 conv layer | |||
| """ | |||
| def __init__(self, num_input_features, growth_rate, bn_size, drop_rate): | |||
| super(_DenseLayer, self).__init__() | |||
| self.norm1 = nn.BatchNorm2d(num_input_features) | |||
| self.relu1 = nn.ReLU() | |||
| self.conv1 = conv1x1(num_input_features, bn_size*growth_rate) | |||
| self.norm2 = nn.BatchNorm2d(bn_size*growth_rate) | |||
| self.relu2 = nn.ReLU() | |||
| self.conv2 = conv3x3(bn_size*growth_rate, growth_rate) | |||
| # nn.Dropout in MindSpore use keep_prob, diff from Pytorch | |||
| self.keep_prob = 1.0 - drop_rate | |||
| self.dropout = nn.Dropout(keep_prob=self.keep_prob) | |||
| def construct(self, features): | |||
| bottleneck = self.conv1(self.relu1(self.norm1(features))) | |||
| new_features = self.conv2(self.relu2(self.norm2(bottleneck))) | |||
| if self.keep_prob < 1: | |||
| new_features = self.dropout(new_features) | |||
| return new_features | |||
| class _DenseBlock(nn.Cell): | |||
| """ | |||
| the dense block | |||
| """ | |||
| def __init__(self, num_layers, num_input_features, bn_size, growth_rate, drop_rate): | |||
| super(_DenseBlock, self).__init__() | |||
| self.cell_list = nn.CellList() | |||
| for i in range(num_layers): | |||
| layer = _DenseLayer( | |||
| num_input_features + i * growth_rate, | |||
| growth_rate=growth_rate, | |||
| bn_size=bn_size, | |||
| drop_rate=drop_rate | |||
| ) | |||
| self.cell_list.append(layer) | |||
| self.concate = P.Concat(axis=1) | |||
| def construct(self, init_features): | |||
| features = init_features | |||
| for layer in self.cell_list: | |||
| new_features = layer(features) | |||
| features = self.concate((features, new_features)) | |||
| return features | |||
| class _Transition(nn.Cell): | |||
| """ | |||
| the transiton layer | |||
| """ | |||
| def __init__(self, num_input_features, num_output_features): | |||
| super(_Transition, self).__init__() | |||
| self.features = nn.SequentialCell(OrderedDict([ | |||
| ('norm', nn.BatchNorm2d(num_input_features)), | |||
| ('relu', nn.ReLU()), | |||
| ('conv', conv1x1(num_input_features, num_output_features)), | |||
| ('pool', nn.MaxPool2d(kernel_size=2, stride=2)) | |||
| ])) | |||
| def construct(self, x): | |||
| x = self.features(x) | |||
| return x | |||
| class Densenet(nn.Cell): | |||
| """ | |||
| the densenet architecture | |||
| """ | |||
| __constants__ = ['features'] | |||
| def __init__(self, growth_rate, block_config, num_init_features, bn_size=4, drop_rate=0): | |||
| super(Densenet, self).__init__() | |||
| layers = OrderedDict() | |||
| layers['conv0'] = conv7x7(3, num_init_features, stride=2, padding=3) | |||
| layers['norm0'] = nn.BatchNorm2d(num_init_features) | |||
| layers['relu0'] = nn.ReLU() | |||
| layers['pool0'] = nn.MaxPool2d(kernel_size=3, stride=2, pad_mode='same') | |||
| # Each denseblock | |||
| num_features = num_init_features | |||
| for i, num_layers in enumerate(block_config): | |||
| block = _DenseBlock( | |||
| num_layers=num_layers, | |||
| num_input_features=num_features, | |||
| bn_size=bn_size, | |||
| growth_rate=growth_rate, | |||
| drop_rate=drop_rate | |||
| ) | |||
| layers['denseblock%d'%(i+1)] = block | |||
| num_features = num_features + num_layers*growth_rate | |||
| if i != len(block_config)-1: | |||
| trans = _Transition(num_input_features=num_features, | |||
| num_output_features=num_features // 2) | |||
| layers['transition%d'%(i+1)] = trans | |||
| num_features = num_features // 2 | |||
| # Final batch norm | |||
| layers['norm5'] = nn.BatchNorm2d(num_features) | |||
| layers['relu5'] = nn.ReLU() | |||
| self.features = nn.SequentialCell(layers) | |||
| self.out_channels = num_features | |||
| def construct(self, x): | |||
| x = self.features(x) | |||
| return x | |||
| def get_out_channels(self): | |||
| return self.out_channels | |||
| def _densenet121(**kwargs): | |||
| return Densenet(growth_rate=32, block_config=(6, 12, 24, 16), num_init_features=64, **kwargs) | |||
| def _densenet161(**kwargs): | |||
| return Densenet(growth_rate=48, block_config=(6, 12, 36, 24), num_init_features=96, **kwargs) | |||
| def _densenet169(**kwargs): | |||
| return Densenet(growth_rate=32, block_config=(6, 12, 32, 32), num_init_features=64, **kwargs) | |||
| def _densenet201(**kwargs): | |||
| return Densenet(growth_rate=32, block_config=(6, 12, 48, 32), num_init_features=64, **kwargs) | |||
| class DenseNet121(nn.Cell): | |||
| """ | |||
| the densenet121 architectur | |||
| """ | |||
| def __init__(self, num_classes): | |||
| super(DenseNet121, self).__init__() | |||
| self.backbone = _densenet121() | |||
| out_channels = self.backbone.get_out_channels() | |||
| self.head = CommonHead(num_classes, out_channels) | |||
| default_recurisive_init(self) | |||
| for _, cell in self.cells_and_names(): | |||
| if isinstance(cell, nn.Conv2d): | |||
| cell.weight.set_data(init.initializer(KaimingNormal(a=math.sqrt(5), mode='fan_out', | |||
| nonlinearity='relu'), | |||
| cell.weight.shape, | |||
| cell.weight.dtype)) | |||
| elif isinstance(cell, nn.BatchNorm2d): | |||
| cell.gamma.set_data(init.initializer('ones', cell.gamma.shape)) | |||
| cell.beta.set_data(init.initializer('zeros', cell.beta.shape)) | |||
| elif isinstance(cell, nn.Dense): | |||
| cell.bias.set_data(init.initializer('zeros', cell.bias.shape)) | |||
| def construct(self, x): | |||
| x = self.backbone(x) | |||
| x = self.head(x) | |||
| return x | |||
| @@ -0,0 +1,41 @@ | |||
| # Copyright 2020 Huawei Technologies Co., Ltd | |||
| # | |||
| # Licensed under the Apache License, Version 2.0 (the "License"); | |||
| # you may not use this file except in compliance with the License. | |||
| # You may obtain a copy of the License at | |||
| # | |||
| # http://www.apache.org/licenses/LICENSE-2.0 | |||
| # | |||
| # Unless required by applicable law or agreed to in writing, software | |||
| # distributed under the License is distributed on an "AS IS" BASIS, | |||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | |||
| # See the License for the specific language governing permissions and | |||
| # limitations under the License. | |||
| # ============================================================================ | |||
| """ | |||
| get parameter function | |||
| """ | |||
| def get_param_groups(network): | |||
| """ | |||
| get parameter groups | |||
| """ | |||
| decay_params = [] | |||
| no_decay_params = [] | |||
| for x in network.trainable_params(): | |||
| parameter_name = x.name | |||
| if parameter_name.endswith('.bias'): | |||
| # all bias not using weight decay | |||
| # print('no decay:{}'.format(parameter_name)) | |||
| no_decay_params.append(x) | |||
| elif parameter_name.endswith('.gamma'): | |||
| # bn weight bias not using weight decay, be carefully for now x not include BN | |||
| # print('no decay:{}'.format(parameter_name)) | |||
| no_decay_params.append(x) | |||
| elif parameter_name.endswith('.beta'): | |||
| # bn weight bias not using weight decay, be carefully for now x not include BN | |||
| # print('no decay:{}'.format(parameter_name)) | |||
| no_decay_params.append(x) | |||
| else: | |||
| decay_params.append(x) | |||
| return [{'params': no_decay_params, 'weight_decay': 0.0}, {'params': decay_params}] | |||
| @@ -0,0 +1,14 @@ | |||
| # Copyright 2020 Huawei Technologies Co., Ltd | |||
| # | |||
| # Licensed under the Apache License, Version 2.0 (the "License"); | |||
| # you may not use this file except in compliance with the License. | |||
| # You may obtain a copy of the License at | |||
| # | |||
| # http://www.apache.org/licenses/LICENSE-2.0 | |||
| # | |||
| # Unless required by applicable law or agreed to in writing, software | |||
| # distributed under the License is distributed on an "AS IS" BASIS, | |||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | |||
| # See the License for the specific language governing permissions and | |||
| # limitations under the License. | |||
| # ============================================================================ | |||
| @@ -0,0 +1,82 @@ | |||
| # Copyright 2020 Huawei Technologies Co., Ltd | |||
| # | |||
| # Licensed under the Apache License, Version 2.0 (the "License"); | |||
| # you may not use this file except in compliance with the License. | |||
| # You may obtain a copy of the License at | |||
| # | |||
| # http://www.apache.org/licenses/LICENSE-2.0 | |||
| # | |||
| # Unless required by applicable law or agreed to in writing, software | |||
| # distributed under the License is distributed on an "AS IS" BASIS, | |||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | |||
| # See the License for the specific language governing permissions and | |||
| # limitations under the License. | |||
| # ============================================================================ | |||
| """ | |||
| get logger. | |||
| """ | |||
| import logging | |||
| import os | |||
| import sys | |||
| from datetime import datetime | |||
| class LOGGER(logging.Logger): | |||
| """ | |||
| set up logging file. | |||
| Args: | |||
| logger_name (string): logger name. | |||
| log_dir (string): path of logger. | |||
| Returns: | |||
| string, logger path | |||
| """ | |||
| def __init__(self, logger_name, rank=0): | |||
| super(LOGGER, self).__init__(logger_name) | |||
| if rank % 8 == 0: | |||
| console = logging.StreamHandler(sys.stdout) | |||
| console.setLevel(logging.INFO) | |||
| formatter = logging.Formatter('%(asctime)s:%(levelname)s:%(message)s') | |||
| console.setFormatter(formatter) | |||
| self.addHandler(console) | |||
| def setup_logging_file(self, log_dir, rank=0): | |||
| """set up log file""" | |||
| self.rank = rank | |||
| if not os.path.exists(log_dir): | |||
| os.makedirs(log_dir, exist_ok=True) | |||
| log_name = datetime.now().strftime('%Y-%m-%d_time_%H_%M_%S') + '_rank_{}.log'.format(rank) | |||
| self.log_fn = os.path.join(log_dir, log_name) | |||
| fh = logging.FileHandler(self.log_fn) | |||
| fh.setLevel(logging.INFO) | |||
| formatter = logging.Formatter('%(asctime)s:%(levelname)s:%(message)s') | |||
| fh.setFormatter(formatter) | |||
| self.addHandler(fh) | |||
| def info(self, msg, *args, **kwargs): | |||
| if self.isEnabledFor(logging.INFO): | |||
| self._log(logging.INFO, msg, args, **kwargs) | |||
| def save_args(self, args): | |||
| self.info('Args:') | |||
| args_dict = vars(args) | |||
| for key in args_dict.keys(): | |||
| self.info('--> %s: %s', key, args_dict[key]) | |||
| self.info('') | |||
| def important_info(self, msg, *args, **kwargs): | |||
| if self.isEnabledFor(logging.INFO) and self.rank == 0: | |||
| line_width = 2 | |||
| important_msg = '\n' | |||
| important_msg += ('*'*70 + '\n')*line_width | |||
| important_msg += ('*'*line_width + '\n')*2 | |||
| important_msg += '*'*line_width + ' '*8 + msg + '\n' | |||
| important_msg += ('*'*line_width + '\n')*2 | |||
| important_msg += ('*'*70 + '\n')*line_width | |||
| self.info(important_msg, *args, **kwargs) | |||
| def get_logger(path, rank): | |||
| logger = LOGGER("mindversion", rank) | |||
| logger.setup_logging_file(path, rank) | |||
| return logger | |||
| @@ -0,0 +1,204 @@ | |||
| # Copyright 2020 Huawei Technologies Co., Ltd | |||
| # | |||
| # Licensed under the Apache License, Version 2.0 (the "License"); | |||
| # you may not use this file except in compliance with the License. | |||
| # You may obtain a copy of the License at | |||
| # | |||
| # http://www.apache.org/licenses/LICENSE-2.0 | |||
| # | |||
| # Unless required by applicable law or agreed to in writing, software | |||
| # distributed under the License is distributed on an "AS IS" BASIS, | |||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | |||
| # See the License for the specific language governing permissions and | |||
| # limitations under the License. | |||
| # ============================================================================ | |||
| """ | |||
| Initialize. | |||
| """ | |||
| import math | |||
| from functools import reduce | |||
| import numpy as np | |||
| import mindspore.nn as nn | |||
| from mindspore import Tensor | |||
| from mindspore.common import initializer as init | |||
| def _calculate_gain(nonlinearity, param=None): | |||
| r""" | |||
| Return the recommended gain value for the given nonlinearity function. | |||
| The values are as follows: | |||
| ================= ==================================================== | |||
| nonlinearity gain | |||
| ================= ==================================================== | |||
| Linear / Identity :math:`1` | |||
| Conv{1,2,3}D :math:`1` | |||
| Sigmoid :math:`1` | |||
| Tanh :math:`\frac{5}{3}` | |||
| ReLU :math:`\sqrt{2}` | |||
| Leaky Relu :math:`\sqrt{\frac{2}{1 + \text{negative\_slope}^2}}` | |||
| ================= ==================================================== | |||
| Args: | |||
| nonlinearity: the non-linear function | |||
| param: optional parameter for the non-linear function | |||
| Examples: | |||
| >>> gain = calculate_gain('leaky_relu', 0.2) # leaky_relu with negative_slope=0.2 | |||
| """ | |||
| linear_fns = ['linear', 'conv1d', 'conv2d', 'conv3d', 'conv_transpose1d', 'conv_transpose2d', 'conv_transpose3d'] | |||
| if nonlinearity in linear_fns or nonlinearity == 'sigmoid': | |||
| return 1 | |||
| if nonlinearity == 'tanh': | |||
| return 5.0 / 3 | |||
| if nonlinearity == 'relu': | |||
| return math.sqrt(2.0) | |||
| if nonlinearity == 'leaky_relu': | |||
| if param is None: | |||
| negative_slope = 0.01 | |||
| elif not isinstance(param, bool) and isinstance(param, int) or isinstance(param, float): | |||
| negative_slope = param | |||
| else: | |||
| raise ValueError("negative_slope {} not a valid number".format(param)) | |||
| return math.sqrt(2.0 / (1 + negative_slope ** 2)) | |||
| raise ValueError("Unsupported nonlinearity {}".format(nonlinearity)) | |||
| def _assignment(arr, num): | |||
| """Assign the value of `num` to `arr`.""" | |||
| if arr.shape == (): | |||
| arr = arr.reshape((1)) | |||
| arr[:] = num | |||
| arr = arr.reshape(()) | |||
| else: | |||
| if isinstance(num, np.ndarray): | |||
| arr[:] = num[:] | |||
| else: | |||
| arr[:] = num | |||
| return arr | |||
| def _calculate_in_and_out(arr): | |||
| """ | |||
| Calculate n_in and n_out. | |||
| Args: | |||
| arr (Array): Input array. | |||
| Returns: | |||
| Tuple, a tuple with two elements, the first element is `n_in` and the second element is `n_out`. | |||
| """ | |||
| dim = len(arr.shape) | |||
| if dim < 2: | |||
| raise ValueError("If initialize data with xavier uniform, the dimension of data must greater than 1.") | |||
| n_in = arr.shape[1] | |||
| n_out = arr.shape[0] | |||
| if dim > 2: | |||
| counter = reduce(lambda x, y: x * y, arr.shape[2:]) | |||
| n_in *= counter | |||
| n_out *= counter | |||
| return n_in, n_out | |||
| def _select_fan(array, mode): | |||
| mode = mode.lower() | |||
| valid_modes = ['fan_in', 'fan_out'] | |||
| if mode not in valid_modes: | |||
| raise ValueError("Mode {} not supported, please use one of {}".format(mode, valid_modes)) | |||
| fan_in, fan_out = _calculate_in_and_out(array) | |||
| return fan_in if mode == 'fan_in' else fan_out | |||
| class KaimingInit(init.Initializer): | |||
| r""" | |||
| Base Class. Initialize the array with He kaiming algorithm. | |||
| Args: | |||
| a: the negative slope of the rectifier used after this layer (only | |||
| used with ``'leaky_relu'``) | |||
| mode: either ``'fan_in'`` (default) or ``'fan_out'``. Choosing ``'fan_in'`` | |||
| preserves the magnitude of the variance of the weights in the | |||
| forward pass. Choosing ``'fan_out'`` preserves the magnitudes in the | |||
| backwards pass. | |||
| nonlinearity: the non-linear function, recommended to use only with | |||
| ``'relu'`` or ``'leaky_relu'`` (default). | |||
| """ | |||
| def __init__(self, a=0, mode='fan_in', nonlinearity='leaky_relu'): | |||
| super(KaimingInit, self).__init__() | |||
| self.mode = mode | |||
| self.gain = _calculate_gain(nonlinearity, a) | |||
| def _initialize(self, arr): | |||
| pass | |||
| class KaimingUniform(KaimingInit): | |||
| r""" | |||
| Initialize the array with He kaiming uniform algorithm. The resulting tensor will | |||
| have values sampled from :math:`\mathcal{U}(-\text{bound}, \text{bound})` where | |||
| .. math:: | |||
| \text{bound} = \text{gain} \times \sqrt{\frac{3}{\text{fan\_mode}}} | |||
| Input: | |||
| arr (Array): The array to be assigned. | |||
| Returns: | |||
| Array, assigned array. | |||
| Examples: | |||
| >>> w = np.empty(3, 5) | |||
| >>> KaimingUniform(w, mode='fan_in', nonlinearity='relu') | |||
| """ | |||
| def _initialize(self, arr): | |||
| fan = _select_fan(arr, self.mode) | |||
| bound = math.sqrt(3.0) * self.gain / math.sqrt(fan) | |||
| data = np.random.uniform(-bound, bound, arr.shape) | |||
| _assignment(arr, data) | |||
| class KaimingNormal(KaimingInit): | |||
| r""" | |||
| Initialize the array with He kaiming normal algorithm. The resulting tensor will | |||
| have values sampled from :math:`\mathcal{N}(0, \text{std}^2)` where | |||
| .. math:: | |||
| \text{std} = \frac{\text{gain}}{\sqrt{\text{fan\_mode}}} | |||
| Input: | |||
| arr (Array): The array to be assigned. | |||
| Returns: | |||
| Array, assigned array. | |||
| Examples: | |||
| >>> w = np.empty(3, 5) | |||
| >>> KaimingNormal(w, mode='fan_out', nonlinearity='relu') | |||
| """ | |||
| def _initialize(self, arr): | |||
| fan = _select_fan(arr, self.mode) | |||
| std = self.gain / math.sqrt(fan) | |||
| data = np.random.normal(0, std, arr.shape) | |||
| _assignment(arr, data) | |||
| def default_recurisive_init(custom_cell): | |||
| """default_recurisive_init""" | |||
| for _, cell in custom_cell.cells_and_names(): | |||
| if isinstance(cell, nn.Conv2d): | |||
| cell.weight.set_data(init.initializer(KaimingUniform(a=math.sqrt(5)), cell.weight.shape, cell.weight.dtype)) | |||
| if cell.bias is not None: | |||
| fan_in, _ = _calculate_in_and_out(cell.weight.asnumpy()) | |||
| bound = 1 / math.sqrt(fan_in) | |||
| cell.bias.set_data(Tensor(np.random.uniform(-bound, bound, cell.bias.shape), cell.bias.dtype)) | |||
| elif isinstance(cell, nn.Dense): | |||
| cell.weight.set_data(init.initializer(KaimingUniform(a=math.sqrt(5)), cell.weight.shape, cell.weight.dtype)) | |||
| if cell.bias is not None: | |||
| fan_in, _ = _calculate_in_and_out(cell.weight.asnumpy()) | |||
| bound = 1 / math.sqrt(fan_in) | |||
| cell.bias.set_data(Tensor(np.random.uniform(-bound, bound, cell.bias.shape), cell.bias.dtype)) | |||
| elif isinstance(cell, (nn.BatchNorm2d, nn.BatchNorm1d)): | |||
| pass | |||
| @@ -0,0 +1,290 @@ | |||
| # Copyright 2020 Huawei Technologies Co., Ltd | |||
| # | |||
| # Licensed under the Apache License, Version 2.0 (the "License"); | |||
| # you may not use this file except in compliance with the License. | |||
| # You may obtain a copy of the License at | |||
| # | |||
| # http://www.apache.org/licenses/LICENSE-2.0 | |||
| # | |||
| # Unless required by applicable law or agreed to in writing, software | |||
| # distributed under the License is distributed on an "AS IS" BASIS, | |||
| # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | |||
| # See the License for the specific language governing permissions and | |||
| # limitations under the License. | |||
| # ============================================================================ | |||
| """train launch.""" | |||
| import os | |||
| import time | |||
| import argparse | |||
| import datetime | |||
| import mindspore.nn as nn | |||
| from mindspore import Tensor | |||
| from mindspore.nn.optim import Momentum | |||
| from mindspore.communication.management import init, get_rank, get_group_size | |||
| from mindspore.train.callback import ModelCheckpoint | |||
| from mindspore.train.callback import CheckpointConfig, Callback | |||
| from mindspore.train.serialization import load_checkpoint, load_param_into_net | |||
| from mindspore.train.model import Model | |||
| from mindspore.train.loss_scale_manager import DynamicLossScaleManager, FixedLossScaleManager | |||
| from mindspore import context | |||
| from mindspore.context import ParallelMode | |||
| from mindspore.common import set_seed | |||
| from src.optimizers import get_param_groups | |||
| from src.network import DenseNet121 | |||
| from src.datasets import classification_dataset | |||
| from src.losses.crossentropy import CrossEntropy | |||
| from src.lr_scheduler import MultiStepLR, CosineAnnealingLR | |||
| from src.utils.logging import get_logger | |||
| from src.config import config | |||
| devid = int(os.getenv('DEVICE_ID')) | |||
| context.set_context(mode=context.GRAPH_MODE, enable_auto_mixed_precision=True, | |||
| device_target="Davinci", save_graphs=False, device_id=devid) | |||
| set_seed(1) | |||
| class BuildTrainNetwork(nn.Cell): | |||
| """build training network""" | |||
| def __init__(self, network, criterion): | |||
| super(BuildTrainNetwork, self).__init__() | |||
| self.network = network | |||
| self.criterion = criterion | |||
| def construct(self, input_data, label): | |||
| output = self.network(input_data) | |||
| loss = self.criterion(output, label) | |||
| return loss | |||
| class ProgressMonitor(Callback): | |||
| """monitor loss and time""" | |||
| def __init__(self, args): | |||
| super(ProgressMonitor, self).__init__() | |||
| self.me_epoch_start_time = 0 | |||
| self.me_epoch_start_step_num = 0 | |||
| self.args = args | |||
| self.ckpt_history = [] | |||
| def begin(self, run_context): | |||
| self.args.logger.info('start network train...') | |||
| def epoch_begin(self, run_context): | |||
| pass | |||
| def epoch_end(self, run_context, *me_args): | |||
| """process epoch end""" | |||
| cb_params = run_context.original_args() | |||
| me_step = cb_params.cur_step_num - 1 | |||
| real_epoch = me_step // self.args.steps_per_epoch | |||
| time_used = time.time() - self.me_epoch_start_time | |||
| fps_mean = self.args.per_batch_size * (me_step-self.me_epoch_start_step_num) * self.args.group_size / time_used | |||
| self.args.logger.info('epoch[{}], iter[{}], loss:{},' | |||
| 'mean_fps:{:.2f} imgs/sec'.format(real_epoch, me_step, cb_params.net_outputs, fps_mean)) | |||
| if self.args.rank_save_ckpt_flag: | |||
| import glob | |||
| ckpts = glob.glob(os.path.join(self.args.outputs_dir, '*.ckpt')) | |||
| for ckpt in ckpts: | |||
| ckpt_fn = os.path.basename(ckpt) | |||
| if not ckpt_fn.startswith('{}-'.format(self.args.rank)): | |||
| continue | |||
| if ckpt in self.ckpt_history: | |||
| continue | |||
| self.ckpt_history.append(ckpt) | |||
| self.args.logger.info('epoch[{}], iter[{}], loss:{}, ckpt:{},' | |||
| 'ckpt_fn:{}'.format(real_epoch, me_step, cb_params.net_outputs, ckpt, ckpt_fn)) | |||
| self.me_epoch_start_step_num = me_step | |||
| self.me_epoch_start_time = time.time() | |||
| def step_begin(self, run_context): | |||
| pass | |||
| def step_end(self, run_context, *me_args): | |||
| pass | |||
| def end(self, run_context): | |||
| self.args.logger.info('end network train...') | |||
| def parse_args(cloud_args=None): | |||
| """parameters""" | |||
| parser = argparse.ArgumentParser('mindspore classification training') | |||
| # dataset related | |||
| parser.add_argument('--data_dir', type=str, default='', help='train data dir') | |||
| # network related | |||
| parser.add_argument('--pretrained', default='', type=str, help='model_path, local pretrained model to load') | |||
| # distributed related | |||
| parser.add_argument('--is_distributed', type=int, default=1, help='if multi device') | |||
| # roma obs | |||
| parser.add_argument('--train_url', type=str, default="", help='train url') | |||
| args, _ = parser.parse_known_args() | |||
| args = merge_args(args, cloud_args) | |||
| args.image_size = config.image_size | |||
| args.num_classes = config.num_classes | |||
| args.lr = config.lr | |||
| args.lr_scheduler = config.lr_scheduler | |||
| args.lr_epochs = config.lr_epochs | |||
| args.lr_gamma = config.lr_gamma | |||
| args.eta_min = config.eta_min | |||
| args.T_max = config.T_max | |||
| args.max_epoch = config.max_epoch | |||
| args.warmup_epochs = config.warmup_epochs | |||
| args.weight_decay = config.weight_decay | |||
| args.momentum = config.momentum | |||
| args.is_dynamic_loss_scale = config.is_dynamic_loss_scale | |||
| args.loss_scale = config.loss_scale | |||
| args.label_smooth = config.label_smooth | |||
| args.label_smooth_factor = config.label_smooth_factor | |||
| args.ckpt_interval = config.ckpt_interval | |||
| args.ckpt_path = config.ckpt_path | |||
| args.is_save_on_master = config.is_save_on_master | |||
| args.rank = config.rank | |||
| args.group_size = config.group_size | |||
| args.log_interval = config.log_interval | |||
| args.per_batch_size = config.per_batch_size | |||
| args.lr_epochs = list(map(int, args.lr_epochs.split(','))) | |||
| args.image_size = list(map(int, args.image_size.split(','))) | |||
| return args | |||
| def merge_args(args, cloud_args): | |||
| """dictionary""" | |||
| args_dict = vars(args) | |||
| if isinstance(cloud_args, dict): | |||
| for key in cloud_args.keys(): | |||
| val = cloud_args[key] | |||
| if key in args_dict and val: | |||
| arg_type = type(args_dict[key]) | |||
| if arg_type is not type(None): | |||
| val = arg_type(val) | |||
| args_dict[key] = val | |||
| return args | |||
| def train(cloud_args=None): | |||
| """training process""" | |||
| args = parse_args(cloud_args) | |||
| # init distributed | |||
| if args.is_distributed: | |||
| init() | |||
| args.rank = get_rank() | |||
| args.group_size = get_group_size() | |||
| if args.is_dynamic_loss_scale == 1: | |||
| args.loss_scale = 1 # for dynamic loss scale can not set loss scale in momentum opt | |||
| # select for master rank save ckpt or all rank save, compatiable for model parallel | |||
| args.rank_save_ckpt_flag = 0 | |||
| if args.is_save_on_master: | |||
| if args.rank == 0: | |||
| args.rank_save_ckpt_flag = 1 | |||
| else: | |||
| args.rank_save_ckpt_flag = 1 | |||
| # logger | |||
| args.outputs_dir = os.path.join(args.ckpt_path, | |||
| datetime.datetime.now().strftime('%Y-%m-%d_time_%H_%M_%S')) | |||
| args.logger = get_logger(args.outputs_dir, args.rank) | |||
| # dataloader | |||
| de_dataset = classification_dataset(args.data_dir, args.image_size, | |||
| args.per_batch_size, args.max_epoch, | |||
| args.rank, args.group_size) | |||
| de_dataset.map_model = 4 | |||
| args.steps_per_epoch = de_dataset.get_dataset_size() | |||
| args.logger.save_args(args) | |||
| # network | |||
| args.logger.important_info('start create network') | |||
| # get network and init | |||
| network = DenseNet121(args.num_classes) | |||
| # loss | |||
| if not args.label_smooth: | |||
| args.label_smooth_factor = 0.0 | |||
| criterion = CrossEntropy(smooth_factor=args.label_smooth_factor, | |||
| num_classes=args.num_classes) | |||
| # load pretrain model | |||
| if os.path.isfile(args.pretrained): | |||
| param_dict = load_checkpoint(args.pretrained) | |||
| param_dict_new = {} | |||
| for key, values in param_dict.items(): | |||
| if key.startswith('moments.'): | |||
| continue | |||
| elif key.startswith('network.'): | |||
| param_dict_new[key[8:]] = values | |||
| else: | |||
| param_dict_new[key] = values | |||
| load_param_into_net(network, param_dict_new) | |||
| args.logger.info('load model {} success'.format(args.pretrained)) | |||
| # lr scheduler | |||
| if args.lr_scheduler == 'exponential': | |||
| lr_scheduler = MultiStepLR(args.lr, | |||
| args.lr_epochs, | |||
| args.lr_gamma, | |||
| args.steps_per_epoch, | |||
| args.max_epoch, | |||
| warmup_epochs=args.warmup_epochs) | |||
| elif args.lr_scheduler == 'cosine_annealing': | |||
| lr_scheduler = CosineAnnealingLR(args.lr, | |||
| args.T_max, | |||
| args.steps_per_epoch, | |||
| args.max_epoch, | |||
| warmup_epochs=args.warmup_epochs, | |||
| eta_min=args.eta_min) | |||
| else: | |||
| raise NotImplementedError(args.lr_scheduler) | |||
| lr_schedule = lr_scheduler.get_lr() | |||
| # optimizer | |||
| opt = Momentum(params=get_param_groups(network), | |||
| learning_rate=Tensor(lr_schedule), | |||
| momentum=args.momentum, | |||
| weight_decay=args.weight_decay, | |||
| loss_scale=args.loss_scale) | |||
| # mixed precision training | |||
| criterion.add_flags_recursive(fp32=True) | |||
| # package training process, adjust lr + forward + backward + optimizer | |||
| train_net = BuildTrainNetwork(network, criterion) | |||
| if args.is_distributed: | |||
| parallel_mode = ParallelMode.DATA_PARALLEL | |||
| else: | |||
| parallel_mode = ParallelMode.STAND_ALONE | |||
| if args.is_dynamic_loss_scale == 1: | |||
| loss_scale_manager = DynamicLossScaleManager(init_loss_scale=65536, scale_factor=2, scale_window=2000) | |||
| else: | |||
| loss_scale_manager = FixedLossScaleManager(args.loss_scale, drop_overflow_update=False) | |||
| context.set_auto_parallel_context(parallel_mode=parallel_mode, device_num=args.group_size, | |||
| parameter_broadcast=True, gradients_mean=True) | |||
| model = Model(train_net, optimizer=opt, metrics=None, loss_scale_manager=loss_scale_manager, amp_level="O3") | |||
| # checkpoint save | |||
| progress_cb = ProgressMonitor(args) | |||
| callbacks = [progress_cb,] | |||
| if args.rank_save_ckpt_flag: | |||
| ckpt_max_num = args.max_epoch * args.steps_per_epoch // args.ckpt_interval | |||
| ckpt_config = CheckpointConfig(save_checkpoint_steps=args.ckpt_interval, | |||
| keep_checkpoint_max=ckpt_max_num) | |||
| ckpt_cb = ModelCheckpoint(config=ckpt_config, | |||
| directory=args.outputs_dir, | |||
| prefix='{}'.format(args.rank)) | |||
| callbacks.append(ckpt_cb) | |||
| model.train(args.max_epoch, de_dataset, callbacks=callbacks) | |||
| if __name__ == "__main__": | |||
| train() | |||
| @@ -40,7 +40,7 @@ config1 = ed({ | |||
| # config for resnet50, imagenet2012 | |||
| config2 = ed({ | |||
| "class_num": 1001, | |||
| "batch_size": 32, | |||
| "batch_size": 256, | |||
| "loss_scale": 1024, | |||
| "momentum": 0.9, | |||
| "weight_decay": 1e-4, | |||
| @@ -55,7 +55,7 @@ config2 = ed({ | |||
| "use_label_smooth": True, | |||
| "label_smooth_factor": 0.1, | |||
| "lr_init": 0, | |||
| "lr_max": 0.1, | |||
| "lr_max": 0.8, | |||
| "lr_end": 0.0 | |||
| }) | |||
| @@ -292,6 +292,7 @@ train_parallel1/log:epcoh: 2 step: 97, loss is 1.7133579 | |||
| ``` | |||
| > About rank_table.json, you can refer to the [distributed training tutorial](https://www.mindspore.cn/tutorial/en/master/advanced_use/distributed_training.html). | |||
| > **Attention** This will bind the processor cores according to the `device_num` and total processor numbers. If you don't expect to run pretraining with binding processor cores, remove the operations about `taskset` in `scripts/run_distribute_train.sh` | |||
| #### Run vgg16 on GPU | |||
| @@ -44,13 +44,20 @@ then | |||
| dataset_type=$3 | |||
| fi | |||
| export DEVICE_NUM=8 | |||
| export RANK_SIZE=8 | |||
| export RANK_TABLE_FILE=$1 | |||
| cpus=`cat /proc/cpuinfo| grep "processor"| wc -l` | |||
| avg=`expr $cpus \/ $RANK_SIZE` | |||
| gap=`expr $avg \- 1` | |||
| for((i=0;i<RANK_SIZE;i++)) | |||
| do | |||
| start=`expr $i \* $avg` | |||
| end=`expr $start \+ $gap` | |||
| cmdopt=$start"-"$end | |||
| export DEVICE_ID=$i | |||
| export RANK_ID=$i | |||
| rm -rf ./train_parallel$i | |||
| @@ -60,6 +67,6 @@ do | |||
| cd ./train_parallel$i || exit | |||
| echo "start training for rank $RANK_ID, device $DEVICE_ID, $dataset_type" | |||
| env > env.log | |||
| python train.py --data_path=$2 --device_target="Ascend" --device_id=$i --is_distributed=1 --dataset=$dataset_type &> log & | |||
| taskset -c $cmdopt python train.py --data_path=$2 --device_target="Ascend" --device_id=$i --is_distributed=1 --dataset=$dataset_type &> log & | |||
| cd .. | |||
| done | |||
| @@ -66,7 +66,7 @@ class FeedForwardNet(nn.Cell): | |||
| ) | |||
| self.get_shape = P.Shape() | |||
| self.reshape = P.Reshape() | |||
| self.dropout = nn.Dropout(keep_prob=1 - hidden_dropout_prob) | |||
| self.dropout = nn.Dropout(keep_prob=1.0 - hidden_dropout_prob) | |||
| def construct(self, input_tensor): | |||
| """ | |||
| @@ -133,7 +133,7 @@ class MultiHeadAttention(nn.Cell): | |||
| self.matmul = P.BatchMatMul() | |||
| self.softmax = nn.Softmax() | |||
| self.dropout = nn.Dropout(1 - attention_dropout_prob) | |||
| self.dropout = nn.Dropout(1.0 - attention_dropout_prob) | |||
| if self.has_attention_mask: | |||
| self.expand_dims = P.ExpandDims() | |||
| @@ -31,7 +31,7 @@ class ResidualConnection(nn.Cell): | |||
| def __init__(self, dropout_prob=0.1): | |||
| super(ResidualConnection, self).__init__() | |||
| self.add = P.TensorAdd() | |||
| self.dropout = nn.Dropout(1 - dropout_prob) | |||
| self.dropout = nn.Dropout(1.0 - dropout_prob) | |||
| def construct(self, hidden_tensor, residual): | |||
| """ | |||
| @@ -104,7 +104,7 @@ class Transformer(nn.Cell): | |||
| self.dtype = config.dtype | |||
| self.cast_compute_type = SaturateCast(dst_type=config.compute_type) | |||
| self.slice = P.StridedSlice() | |||
| self.dropout = nn.Dropout(keep_prob=1 - config.hidden_dropout_prob) | |||
| self.dropout = nn.Dropout(keep_prob=1.0 - config.hidden_dropout_prob) | |||
| self._create_attention_mask_from_input_mask = CreateAttentionMaskFromInputMask(config) | |||