| @@ -1,24 +1,66 @@ | |||
| # EfficientNet-B0 Example | |||
| # Contents | |||
| ## Description | |||
| - [EfficientNet-B0 Description](#efficientnet-description) | |||
| - [Model Architecture](#model-architecture) | |||
| - [Dataset](#dataset) | |||
| - [Environment Requirements](#environment-requirements) | |||
| - [Quick Start](#quick-start) | |||
| - [Script Description](#script-description) | |||
| - [Script and Sample Code](#script-and-sample-code) | |||
| - [Script Parameters](#script-parameters) | |||
| - [Training Process](#training-process) | |||
| - [Evaluation Process](#evaluation-process) | |||
| - [Model Description](#model-description) | |||
| - [Performance](#performance) | |||
| - [Training Performance](#evaluation-performance) | |||
| - [Inference Performance](#evaluation-performance) | |||
| - [ModelZoo Homepage](#modelzoo-homepage) | |||
| This is an example of training EfficientNet-B0 in MindSpore. | |||
| # [EfficientNet-B0 Description](#contents) | |||
| ## Requirements | |||
| - Install [Mindspore](http://www.mindspore.cn/install/en). | |||
| - Download the dataset. | |||
| [Paper](https://arxiv.org/abs/1905.11946): Mingxing Tan, Quoc V. Le. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. 2019. | |||
| ## Structure | |||
| # [Model architecture](#contents) | |||
| ```shell | |||
| The overall network architecture of EfficientNet-B0 is show below: | |||
| [Link](https://arxiv.org/abs/1905.11946) | |||
| # [Dataset](#contents) | |||
| Dataset used: [imagenet](http://www.image-net.org/) | |||
| - Dataset size: ~125G, 1.2W colorful images in 1000 classes | |||
| - Train: 120G, 1.2W images | |||
| - Test: 5G, 50000 images | |||
| - Data format: RGB images. | |||
| - Note: Data will be processed in src/dataset.py | |||
| # [Environment Requirements](#contents) | |||
| - Hardware GPU | |||
| - Prepare hardware environment with GPU processor. | |||
| - Framework | |||
| - [MindSpore](https://www.mindspore.cn/install/en) | |||
| - For more information, please check the resources below: | |||
| - [MindSpore Tutorials](https://www.mindspore.cn/tutorial/training/en/master/index.html) | |||
| - [MindSpore Python API](https://www.mindspore.cn/doc/api_python/en/master/index.html) | |||
| # [Script description](#contents) | |||
| ## [Script and sample code](#contents) | |||
| ```python | |||
| . | |||
| └─nasnet | |||
| └─efficientnet | |||
| ├─README.md | |||
| ├─scripts | |||
| ├─run_standalone_train_for_gpu.sh # launch standalone training with gpu platform(1p) | |||
| ├─run_distribute_train_for_gpu.sh # launch distributed training with gpu platform(8p) | |||
| └─run_eval_for_gpu.sh # launch evaluating with gpu platform | |||
| ├─scripts | |||
| ├─run_standalone_train_for_gpu.sh # launch standalone training with gpu platform(1p) | |||
| ├─run_distribute_train_for_gpu.sh # launch distributed training with gpu platform(8p) | |||
| └─run_eval_for_gpu.sh # launch evaluating with gpu platform | |||
| ├─src | |||
| ├─config.py # parameter configuration | |||
| ├─dataset.py # data preprocessing | |||
| @@ -26,16 +68,16 @@ This is an example of training EfficientNet-B0 in MindSpore. | |||
| ├─loss.py # Customized loss function | |||
| ├─transform_utils.py # random augment utils | |||
| ├─transform.py # random augment class | |||
| ├─eval.py # eval net | |||
| └─train.py # train net | |||
| ├─eval.py # eval net | |||
| └─train.py # train net | |||
| ``` | |||
| ## Parameter Configuration | |||
| ## [Script Parameters](#contents) | |||
| Parameters for both training and evaluating can be set in config.py | |||
| Parameters for both training and evaluating can be set in config.py. | |||
| ``` | |||
| ``` | |||
| 'random_seed': 1, # fix random seed | |||
| 'model': 'efficientnet_b0', # model name | |||
| 'drop': 0.2, # dropout rate | |||
| @@ -45,9 +87,9 @@ Parameters for both training and evaluating can be set in config.py | |||
| 'batch_size': 128, # batch size | |||
| 'decay_epochs': 2.4, # epoch interval to decay LR | |||
| 'warmup_epochs': 5, # epochs to warmup LR | |||
| 'decay_rate': 0.97, # LR decay rate | |||
| 'decay_rate': 0.97, # LR decay rate | |||
| 'weight_decay': 1e-5, # weight decay | |||
| 'epochs': 600, # number of epochs to train | |||
| 'epochs': 600, # number of epochs to train | |||
| 'workers': 8, # number of data processing processes | |||
| 'amp_level': 'O0', # amp level | |||
| 'opt': 'rmsprop', # optimizer | |||
| @@ -62,35 +104,34 @@ Parameters for both training and evaluating can be set in config.py | |||
| 'resume_start_epoch': 0, # resume start epoch | |||
| ``` | |||
| ## Running the example | |||
| ### Train | |||
| ## [Training Process](#contents) | |||
| #### Usage | |||
| ``` | |||
| # distribute training example(8p) | |||
| sh run_distribute_train_for_gpu.sh DATA_DIR | |||
| # standalone training | |||
| sh run_standalone_train_for_gpu.sh DATA_DIR DEVICE_ID | |||
| GPU: | |||
| # distribute training example(8p) | |||
| sh run_distribute_train_for_gpu.sh | |||
| # standalone training | |||
| sh run_standalone_train_for_gpu.sh DEVICE_ID DATA_DIR | |||
| ``` | |||
| #### Launch | |||
| ```bash | |||
| # distributed training example(8p) for GPU | |||
| sh scripts/run_distribute_train_for_gpu.sh /dataset | |||
| cd scripts | |||
| sh run_distribute_train_for_gpu.sh 8 0,1,2,3,4,5,6,7 /dataset/train | |||
| # standalone training example for GPU | |||
| sh scripts/run_standalone_train_for_gpu.sh /dataset 0 | |||
| cd scripts | |||
| sh run_standalone_train_for_gpu.sh 0 /dataset/train | |||
| ``` | |||
| #### Result | |||
| You can find checkpoint file together with result in log. | |||
| ### Evaluation | |||
| ## [Evaluation Process](#contents) | |||
| #### Usage | |||
| ### Usage | |||
| ``` | |||
| # Evaluation | |||
| @@ -101,11 +142,51 @@ sh run_eval_for_gpu.sh DATA_DIR DEVICE_ID PATH_CHECKPOINT | |||
| ```bash | |||
| # Evaluation with checkpoint | |||
| sh scripts/run_eval_for_gpu.sh /dataset 0 ./checkpoint/efficientnet_b0-600_1251.ckpt | |||
| cd scripts | |||
| sh run_eval_for_gpu.sh /dataset/eval ./checkpoint/efficientnet_b0-600_1251.ckpt | |||
| ``` | |||
| > checkpoint can be produced in training process. | |||
| #### Result | |||
| Evaluation result will be stored in the scripts path. Under this, you can find result like the followings in log. | |||
| ``` | |||
| acc=76.96%(TOP1) | |||
| ``` | |||
| # [Model description](#contents) | |||
| ## [Performance](#contents) | |||
| ### Training Performance | |||
| | Parameters | efficientnet_b0 | | |||
| | -------------------------- | ------------------------- | | |||
| | Resource | NV SMX2 V100-32G | | |||
| | uploaded Date | 10/26/2020 | | |||
| | MindSpore Version | 1.0.0 | | |||
| | Dataset | ImageNet | | |||
| | Training Parameters | src/config.py | | |||
| | Optimizer | rmsprop | | |||
| | Loss Function | LabelSmoothingCrossEntropy | | |||
| | Loss | 1.8886 | | |||
| | Accuracy | 76.96%(TOP1) | | |||
| | Total time | 132 h 8ps | | |||
| | Checkpoint for Fine tuning | 64 M(.ckpt file) | | |||
| ### Inference Performance | |||
| | Parameters | | | |||
| | -------------------------- | ------------------------- | | |||
| | Resource | NV SMX2 V100-32G | | |||
| | uploaded Date | 10/26/2020 | | |||
| | MindSpore Version | 1.0.0 | | |||
| | Dataset | ImageNet, 1.2W | | |||
| | batch_size | 128 | | |||
| | outputs | probability | | |||
| | Accuracy | acc=76.96%(TOP1) | | |||
| # [ModelZoo Homepage](#contents) | |||
| Please check the official [homepage](https://gitee.com/mindspore/mindspore/tree/master/model_zoo). | |||
| @@ -49,7 +49,7 @@ if __name__ == '__main__': | |||
| ckpt = load_checkpoint(args_opt.checkpoint) | |||
| load_param_into_net(net, ckpt) | |||
| net.set_train(False) | |||
| val_data_url = os.path.join(args_opt.data_path, 'val') | |||
| val_data_url = args_opt.data_path | |||
| dataset = create_dataset_val(cfg.batch_size, val_data_url, workers=cfg.workers, distributed=False) | |||
| loss = LabelSmoothingCrossEntropy(smooth_factor=cfg.smoothing) | |||
| eval_metrics = {'Loss': nn.Loss(), | |||
| @@ -13,20 +13,57 @@ | |||
| # See the License for the specific language governing permissions and | |||
| # limitations under the License. | |||
| # ============================================================================ | |||
| DATA_DIR=$1 | |||
| if [ $# != 3 ] && [ $# != 4 ] | |||
| then | |||
| echo "Usage: | |||
| sh run_distribute_train_for_gpu.sh [DEVICE_NUM] [VISIABLE_DEVICES(0,1,2,3,4,5,6,7)] [DATASET_PATH] [PRETRAINED_CKPT_PATH](optional) | |||
| " | |||
| exit 1 | |||
| fi | |||
| current_exec_path=$(pwd) | |||
| echo ${current_exec_path} | |||
| if [ $1 -lt 1 ] && [ $1 -gt 8 ] | |||
| then | |||
| echo "error: DEVICE_NUM=$1 is not in (1-8)" | |||
| exit 1 | |||
| fi | |||
| curtime=`date '+%Y%m%d-%H%M%S'` | |||
| RANK_SIZE=8 | |||
| # check dataset file | |||
| if [ ! -d $3 ] | |||
| then | |||
| echo "error: DATASET_PATH=$3 is not a directory" | |||
| exit 1 | |||
| fi | |||
| rm ${current_exec_path}/device_parallel/ -rf | |||
| mkdir ${current_exec_path}/device_parallel | |||
| echo ${curtime} > ${current_exec_path}/device_parallel/starttime | |||
| export DEVICE_NUM=$1 | |||
| export RANK_SIZE=$1 | |||
| BASEPATH=$(cd "`dirname $0`" || exit; pwd) | |||
| export PYTHONPATH=${BASEPATH}:$PYTHONPATH | |||
| if [ -d "../train" ]; | |||
| then | |||
| rm -rf ../train | |||
| fi | |||
| mkdir ../train | |||
| cd ../train || exit | |||
| export CUDA_VISIBLE_DEVICES="$2" | |||
| if [ $# == 3 ] | |||
| then | |||
| mpirun -n $1 --allow-run-as-root --output-filename log_output --merge-stderr-to-stdout \ | |||
| python ${BASEPATH}/../train.py \ | |||
| --GPU \ | |||
| --distributed \ | |||
| --data_path $3 > train.log 2>&1 & | |||
| fi | |||
| if [ $# == 4 ] | |||
| then | |||
| mpirun -n $1 --allow-run-as-root --output-filename log_output --merge-stderr-to-stdout \ | |||
| python ${BASEPATH}/../train.py \ | |||
| --GPU \ | |||
| --distributed \ | |||
| --data_path $3 \ | |||
| --resume $4 > train.log 2>&1 & | |||
| fi | |||
| mpirun --allow-run-as-root -n $RANK_SIZE python ${current_exec_path}/train.py \ | |||
| --GPU \ | |||
| --distributed \ | |||
| --data_path ${DATA_DIR} \ | |||
| --cur_time ${curtime} > ${current_exec_path}/device_parallel/efficientnet_b0.log 2>&1 & | |||
| @@ -13,15 +13,34 @@ | |||
| # See the License for the specific language governing permissions and | |||
| # limitations under the License. | |||
| # ============================================================================ | |||
| DATA_DIR=$1 | |||
| DEVICE_ID=$2 | |||
| PATH_CHECKPOINT=$3 | |||
| if [ $# != 2 ] | |||
| then | |||
| echo "GPU: sh run_eval_for_gpu.sh [DATASET_PATH] [CHECKPOINT_PATH]" | |||
| exit 1 | |||
| fi | |||
| current_exec_path=$(pwd) | |||
| echo ${current_exec_path} | |||
| # check dataset file | |||
| if [ ! -d $1 ] | |||
| then | |||
| echo "error: DATASET_PATH=$1 is not a directory" | |||
| exit 1 | |||
| fi | |||
| curtime=`date '+%Y%m%d-%H%M%S'` | |||
| # check checkpoint file | |||
| if [ ! -f $2 ] | |||
| then | |||
| echo "error: CHECKPOINT_PATH=$2 is not a file" | |||
| exit 1 | |||
| fi | |||
| echo ${curtime} > ${current_exec_path}/eval_starttime | |||
| BASEPATH=$(cd "`dirname $0`" || exit; pwd) | |||
| export PYTHONPATH=${BASEPATH}:$PYTHONPATH | |||
| CUDA_VISIBLE_DEVICES=${DEVICE_ID} python ./eval.py --platform 'GPU' --data_path ${DATA_DIR} --checkpoint ${PATH_CHECKPOINT} > ${current_exec_path}/eval.log 2>&1 & | |||
| if [ -d "../eval" ]; | |||
| then | |||
| rm -rf ../eval | |||
| fi | |||
| mkdir ../eval | |||
| cd ../eval || exit | |||
| python ${BASEPATH}/../eval.py --platform 'GPU' --data_path $1 --checkpoint=$2 > ./eval.log 2>&1 & | |||
| @@ -13,19 +13,38 @@ | |||
| # See the License for the specific language governing permissions and | |||
| # limitations under the License. | |||
| # ============================================================================ | |||
| DATA_DIR=$1 | |||
| DEVICE_ID=$2 | |||
| if [ $# != 2 ] && [ $# != 3 ] | |||
| then | |||
| echo "Usage: | |||
| sh run_standalone_train_for_gpu.sh [DEVICE_ID] [DATASET_PATH] [PRETRAINED_CKPT_PATH](optional) | |||
| " | |||
| exit 1 | |||
| fi | |||
| current_exec_path=$(pwd) | |||
| echo ${current_exec_path} | |||
| # check dataset file | |||
| if [ ! -d $2 ] | |||
| then | |||
| echo "error: DATASET_PATH=$2 is not a directory" | |||
| exit 1 | |||
| fi | |||
| curtime=`date '+%Y%m%d-%H%M%S'` | |||
| BASEPATH=$(cd "`dirname $0`" || exit; pwd) | |||
| export PYTHONPATH=${BASEPATH}:$PYTHONPATH | |||
| if [ -d "../train" ]; | |||
| then | |||
| rm -rf ../train | |||
| fi | |||
| mkdir ../train | |||
| cd ../train || exit | |||
| rm ${current_exec_path}/device_${DEVICE_ID}/ -rf | |||
| mkdir ${current_exec_path}/device_${DEVICE_ID} | |||
| echo ${curtime} > ${current_exec_path}/device_${DEVICE_ID}/starttime | |||
| export CUDA_VISIBLE_DEVICES=$1 | |||
| CUDA_VISIBLE_DEVICES=${DEVICE_ID} python ${current_exec_path}/train.py \ | |||
| --GPU \ | |||
| --data_path ${DATA_DIR} \ | |||
| --cur_time ${curtime} > ${current_exec_path}/device_${DEVICE_ID}/efficientnet_b0.log 2>&1 & | |||
| if [ $# == 2 ] | |||
| then | |||
| python ${BASEPATH}/../train.py --GPU --data_path $2 > train.log 2>&1 & | |||
| fi | |||
| if [ $# == 3 ] | |||
| then | |||
| python ${BASEPATH}/../train.py --GPU --data_path $2 --resume $3 > train.log 2>&1 & | |||
| fi | |||
| @@ -85,7 +85,6 @@ def create_dataset(batch_size, train_data_url='', workers=8, distributed=False): | |||
| input_columns=["image", "label"], | |||
| num_parallel_workers=2, | |||
| drop_remainder=True) | |||
| ds_train = ds_train.repeat(1) | |||
| return ds_train | |||
| @@ -121,5 +120,4 @@ def create_dataset_val(batch_size=128, val_data_url='', workers=8, distributed=F | |||
| dataset = dataset.map(input_columns=["label"], operations=type_cast_op, num_parallel_workers=workers) | |||
| dataset = dataset.map(input_columns=["image"], operations=ctrans, num_parallel_workers=workers) | |||
| dataset = dataset.batch(batch_size, drop_remainder=True, num_parallel_workers=workers) | |||
| dataset = dataset.repeat(1) | |||
| return dataset | |||
| @@ -17,7 +17,6 @@ import argparse | |||
| import math | |||
| import os | |||
| import random | |||
| import time | |||
| import numpy as np | |||
| import mindspore | |||
| @@ -115,8 +114,6 @@ def main(): | |||
| if args.GPU: | |||
| context.set_context(device_target='GPU') | |||
| is_master = not args.distributed or (rank_id == 0) | |||
| net = efficientnet_b0(num_classes=cfg.num_classes, | |||
| drop_rate=cfg.drop, | |||
| drop_connect_rate=cfg.drop_connect, | |||
| @@ -124,18 +121,7 @@ def main(): | |||
| bn_tf=cfg.bn_tf, | |||
| ) | |||
| cur_time = args.cur_time | |||
| output_base = './output' | |||
| exp_name = '-'.join([ | |||
| cur_time, | |||
| cfg.model, | |||
| str(224) | |||
| ]) | |||
| time.sleep(rank_id) | |||
| output_dir = get_outdir(output_base, exp_name) | |||
| train_data_url = os.path.join(args.data_path, 'train') | |||
| train_data_url = args.data_path | |||
| train_dataset = create_dataset( | |||
| cfg.batch_size, train_data_url, workers=cfg.workers, distributed=args.distributed) | |||
| batches_per_epoch = train_dataset.get_dataset_size() | |||
| @@ -152,7 +138,7 @@ def main(): | |||
| config_ck = CheckpointConfig( | |||
| save_checkpoint_steps=batches_per_epoch, keep_checkpoint_max=cfg.keep_checkpoint_max) | |||
| ckpoint_cb = ModelCheckpoint( | |||
| prefix=cfg.model, directory=output_dir, config=config_ck) | |||
| prefix=cfg.model, directory='./ckpt_' + str(rank_id) + '/', config=config_ck) | |||
| callbacks += [ckpoint_cb] | |||
| lr = Tensor(get_lr(base_lr=cfg.lr, total_epochs=cfg.epochs, steps_per_epoch=batches_per_epoch, | |||
| @@ -180,7 +166,7 @@ def main(): | |||
| amp_level=cfg.amp_level | |||
| ) | |||
| callbacks = callbacks if is_master else [] | |||
| # callbacks = callbacks if is_master else [] | |||
| if args.resume: | |||
| real_epoch = cfg.epochs - cfg.resume_start_epoch | |||
| @@ -0,0 +1,130 @@ | |||
| # NASNet示例 | |||
| <!-- TOC --> | |||
| - [NASNet示例](#nasnet示例) | |||
| - [概述](#概述) | |||
| - [要求](#要求) | |||
| - [结构](#结构) | |||
| - [参数配置](#参数配置) | |||
| - [运行示例](#运行示例) | |||
| - [训练](#训练) | |||
| - [用法](#用法) | |||
| - [运行](#运行) | |||
| - [结果](#结果) | |||
| - [评估](#评估) | |||
| - [用法](#用法-1) | |||
| - [启动](#启动) | |||
| - [结果](#结果-1) | |||
| <!-- /TOC --> | |||
| ## 概述 | |||
| 此为MindSpore中训练NASNet-A-Mobile的示例。 | |||
| ## 要求 | |||
| - 安装[Mindspore](http://www.mindspore.cn/install/en)。 | |||
| - 下载数据集。 | |||
| ## 结构 | |||
| ```shell | |||
| . | |||
| └─nasnet | |||
| ├─README.md | |||
| ├─scripts | |||
| ├─run_standalone_train_for_gpu.sh # 使用GPU平台启动单机训练(单卡) | |||
| ├─Run_distribute_train_for_gpu.sh # 使用GPU平台启动分布式训练(8卡) | |||
| └─Run_eval_for_gpu.sh # 使用GPU平台进行启动评估 | |||
| ├─src | |||
| ├─config.py # 参数配置 | |||
| ├─dataset.py # 数据预处理 | |||
| ├─loss.py # 自定义交叉熵损失函数 | |||
| ├─lr_generator.py # 学习率生成器 | |||
| ├─nasnet_a_mobile.py # 网络定义 | |||
| ├─eval.py # 评估网络 | |||
| ├─export.py # 转换检查点 | |||
| └─train.py # 训练网络 | |||
| ``` | |||
| ## 参数配置 | |||
| 在config.py中可以同时配置训练参数和评估参数。 | |||
| ``` | |||
| 'random_seed':1, # 固定随机种子 | |||
| 'rank':0, # 分布式训练进程序号 | |||
| 'group_size':1, # 分布式训练分组大小 | |||
| 'work_nums':8, # 数据读取人员数 | |||
| 'epoch_size':500, # 总周期数 | |||
| 'keep_checkpoint_max':100, # 保存检查点最大数 | |||
| 'ckpt_path':'./checkpoint/', # 检查点保存路径 | |||
| 'is_save_on_master':1 # 在rank0上保存检查点,分布式参数 | |||
| 'batch_size':32, # 输入批次大小 | |||
| 'num_classes':1000, # 数据集类数 | |||
| 'label_smooth_factor':0.1, # 标签平滑因子 | |||
| 'aux_factor':0.4, # 副对数损失系数 | |||
| 'lr_init':0.04, # 启动学习率 | |||
| 'lr_decay_rate':0.97, # 学习率衰减率 | |||
| 'num_epoch_per_decay':2.4, # 衰减周期数 | |||
| 'weight_decay':0.00004, # 重量衰减 | |||
| 'momentum':0.9, # 动量 | |||
| 'opt_eps':1.0, # epsilon参数 | |||
| 'rmsprop_decay':0.9, # rmsprop衰减 | |||
| 'loss_scale':1, # 损失规模 | |||
| ``` | |||
| ## 运行示例 | |||
| ### 训练 | |||
| #### 用法 | |||
| ``` | |||
| # 分布式训练示例(8卡) | |||
| sh run_distribute_train_for_gpu.sh DATA_DIR | |||
| # 单机训练 | |||
| sh run_standalone_train_for_gpu.sh DEVICE_ID DATA_DIR | |||
| ``` | |||
| #### 运行 | |||
| ```bash | |||
| # GPU分布式训练示例(8卡) | |||
| sh scripts/run_distribute_train_for_gpu.sh /dataset/train | |||
| # GPU单机训练示例 | |||
| sh scripts/run_standalone_train_for_gpu.sh 0 /dataset/train | |||
| ``` | |||
| #### 结果 | |||
| 可以在日志中找到检查点文件及结果。 | |||
| ### 评估 | |||
| #### 用法 | |||
| ``` | |||
| # 评估 | |||
| sh run_eval_for_gpu.sh DEVICE_ID DATA_DIR PATH_CHECKPOINT | |||
| ``` | |||
| #### 启动 | |||
| ```bash | |||
| # 检查点评估 | |||
| sh scripts/run_eval_for_gpu.sh 0 /dataset/val ./checkpoint/nasnet-a-mobile-rank0-248_10009.ckpt | |||
| ``` | |||
| > 训练过程中可以生成检查点。 | |||
| #### 结果 | |||
| 评估结果保存在脚本路径下。路径下的日志中,可以找到如下结果: | |||