update efficientnet scripts & nasnet cn readme

5 years ago · 148fc597f6
--- a/model_zoo/official/cv/efficientnet/README.md
+++ b/model_zoo/official/cv/efficientnet/README.md
@@ -1,24 +1,66 @@
 # EfficientNet-B0 Example
 # Contents

 ## Description
 - [EfficientNet-B0 Description](#efficientnet-description)
 - [Model Architecture](#model-architecture)
 - [Dataset](#dataset)
 - [Environment Requirements](#environment-requirements)
 - [Quick Start](#quick-start)
 - [Script Description](#script-description)
    - [Script and Sample Code](#script-and-sample-code)
    - [Script Parameters](#script-parameters)
    - [Training Process](#training-process)
    - [Evaluation Process](#evaluation-process)
 - [Model Description](#model-description)
    - [Performance](#performance)
        - [Training Performance](#evaluation-performance)
        - [Inference Performance](#evaluation-performance)
 - [ModelZoo Homepage](#modelzoo-homepage)

 This is an example of training EfficientNet-B0 in MindSpore.
 # [EfficientNet-B0 Description](#contents)

 ## Requirements

 - Install [Mindspore](http://www.mindspore.cn/install/en).
 - Download the dataset.
 [Paper](https://arxiv.org/abs/1905.11946): Mingxing Tan, Quoc V. Le. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. 2019.

 ## Structure
 # [Model architecture](#contents)

 ```shell
 The overall network architecture of EfficientNet-B0 is show below:

 [Link](https://arxiv.org/abs/1905.11946)


 # [Dataset](#contents)

 Dataset used: [imagenet](http://www.image-net.org/)

 - Dataset size: ~125G, 1.2W colorful images in 1000 classes
  - Train: 120G, 1.2W images
  - Test: 5G, 50000 images
 - Data format: RGB images.
  - Note: Data will be processed in src/dataset.py


 # [Environment Requirements](#contents)

 - Hardware GPU
  - Prepare hardware environment with GPU processor.
 - Framework
  - [MindSpore](https://www.mindspore.cn/install/en)
 - For more information, please check the resources below：
  - [MindSpore Tutorials](https://www.mindspore.cn/tutorial/training/en/master/index.html)
  - [MindSpore Python API](https://www.mindspore.cn/doc/api_python/en/master/index.html)

 # [Script description](#contents)

 ## [Script and sample code](#contents)

 ```python
 .
 └─nasnet      
 └─efficientnet
  ├─README.md
  ├─scripts      
    ├─run_standalone_train_for_gpu.sh         # launch standalone training with gpu platform(1p)
    ├─run_distribute_train_for_gpu.sh         # launch distributed training with gpu platform(8p)
    └─run_eval_for_gpu.sh                     # launch evaluating with gpu platform
  ├─scripts
    ├─run_standalone_train_for_gpu.sh # launch standalone training with gpu platform(1p)
    ├─run_distribute_train_for_gpu.sh # launch distributed training with gpu platform(8p)
    └─run_eval_for_gpu.sh             # launch evaluating with gpu platform
  ├─src
    ├─config.py                       # parameter configuration
    ├─dataset.py                      # data preprocessing
@@ -26,16 +68,16 @@ This is an example of training EfficientNet-B0 in MindSpore.
    ├─loss.py                         # Customized loss function
    ├─transform_utils.py              # random augment utils
    ├─transform.py                    # random augment class
  ├─eval.py                           # eval net
  └─train.py                          # train net
 ├─eval.py                             # eval net
 └─train.py                            # train net

 ```

 ## Parameter Configuration
 ## [Script Parameters](#contents)

 Parameters for both training and evaluating can be set in config.py
 Parameters for both training and evaluating can be set in config.py.

 ```       
 ```
 'random_seed': 1,                # fix random seed
 'model': 'efficientnet_b0',      # model name
 'drop': 0.2,                     # dropout rate
@@ -45,9 +87,9 @@ Parameters for both training and evaluating can be set in config.py
 'batch_size': 128,               # batch size
 'decay_epochs': 2.4,             # epoch interval to decay LR
 'warmup_epochs': 5,              # epochs to warmup LR
 'decay_rate': 0.97,              # LR decay rate   
 'decay_rate': 0.97,              # LR decay rate
 'weight_decay': 1e-5,            # weight decay
 'epochs': 600,                   # number of epochs to train    
 'epochs': 600,                   # number of epochs to train
 'workers': 8,                    # number of data processing processes
 'amp_level': 'O0',               # amp level
 'opt': 'rmsprop',                # optimizer
@@ -62,35 +104,34 @@ Parameters for both training and evaluating can be set in config.py
 'resume_start_epoch': 0,         # resume start epoch
 ```

 ## Running the example

 ### Train
 ## [Training Process](#contents)

 #### Usage

 ```
 # distribute training example(8p)
 sh run_distribute_train_for_gpu.sh DATA_DIR
 # standalone training
 sh run_standalone_train_for_gpu.sh DATA_DIR DEVICE_ID
 GPU:
    # distribute training example(8p)
    sh run_distribute_train_for_gpu.sh 
    # standalone training
    sh run_standalone_train_for_gpu.sh DEVICE_ID DATA_DIR
 ```

 #### Launch

 ```bash
 # distributed training example(8p) for GPU
 sh scripts/run_distribute_train_for_gpu.sh /dataset
 cd scripts
 sh run_distribute_train_for_gpu.sh 8 0,1,2,3,4,5,6,7 /dataset/train
 # standalone training example for GPU
 sh scripts/run_standalone_train_for_gpu.sh /dataset 0
 cd scripts
 sh run_standalone_train_for_gpu.sh 0 /dataset/train
 ```

 #### Result

 You can find checkpoint file together with result in log.

 ### Evaluation
 ## [Evaluation Process](#contents)

 #### Usage
 ### Usage

 ```
 # Evaluation
@@ -101,11 +142,51 @@ sh run_eval_for_gpu.sh DATA_DIR DEVICE_ID PATH_CHECKPOINT

 ```bash
 # Evaluation with checkpoint
 sh scripts/run_eval_for_gpu.sh /dataset 0 ./checkpoint/efficientnet_b0-600_1251.ckpt
 cd scripts
 sh run_eval_for_gpu.sh /dataset/eval ./checkpoint/efficientnet_b0-600_1251.ckpt
 ```

 > checkpoint can be produced in training process.

 #### Result

 Evaluation result will be stored in the scripts path. Under this, you can find result like the followings in log.

 ```
 acc=76.96%(TOP1)
 ```

 # [Model description](#contents)

 ## [Performance](#contents)

 ### Training Performance

 | Parameters                 | efficientnet_b0           |
 | -------------------------- | ------------------------- |
 | Resource                   | NV SMX2 V100-32G          |
 | uploaded Date              | 10/26/2020                |
 | MindSpore Version          | 1.0.0                     |
 | Dataset                    | ImageNet                  |
 | Training Parameters        | src/config.py             |
 | Optimizer                  | rmsprop                   |
 | Loss Function              | LabelSmoothingCrossEntropy |
 | Loss                       | 1.8886                    |
 | Accuracy                   | 76.96%(TOP1)               |
 | Total time                 | 132 h 8ps                 |
 | Checkpoint for Fine tuning | 64 M(.ckpt file)         |

 ### Inference Performance

 | Parameters                 |                           |
 | -------------------------- | ------------------------- |
 | Resource                   | NV SMX2 V100-32G          |
 | uploaded Date              | 10/26/2020                |
 | MindSpore Version          | 1.0.0                     |
 | Dataset                    | ImageNet, 1.2W            |
 | batch_size                 | 128                       |
 | outputs                    | probability               |
 | Accuracy                   | acc=76.96%(TOP1)          |


 # [ModelZoo Homepage](#contents)
 
 Please check the official [homepage](https://gitee.com/mindspore/mindspore/tree/master/model_zoo).
--- a/model_zoo/official/cv/efficientnet/eval.py
+++ b/model_zoo/official/cv/efficientnet/eval.py
@@ -49,7 +49,7 @@ if __name__ == '__main__':
    ckpt = load_checkpoint(args_opt.checkpoint)
    load_param_into_net(net, ckpt)
    net.set_train(False)
    val_data_url = os.path.join(args_opt.data_path, 'val')
    val_data_url = args_opt.data_path
    dataset = create_dataset_val(cfg.batch_size, val_data_url, workers=cfg.workers, distributed=False)
    loss = LabelSmoothingCrossEntropy(smooth_factor=cfg.smoothing)
    eval_metrics = {'Loss': nn.Loss(),
--- a/model_zoo/official/cv/efficientnet/scripts/run_distribute_train_for_gpu.sh
+++ b/model_zoo/official/cv/efficientnet/scripts/run_distribute_train_for_gpu.sh
@@ -13,20 +13,57 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ============================================================================
 DATA_DIR=$1
 if [ $# != 3 ] && [ $# != 4 ]
 then
    echo "Usage:
          sh run_distribute_train_for_gpu.sh [DEVICE_NUM] [VISIABLE_DEVICES(0,1,2,3,4,5,6,7)] [DATASET_PATH] [PRETRAINED_CKPT_PATH](optional)
          "
 exit 1
 fi

 current_exec_path=$(pwd)
 echo ${current_exec_path}
 if [ $1 -lt 1 ] && [ $1 -gt 8 ]
 then
    echo "error: DEVICE_NUM=$1 is not in (1-8)"
 exit 1
 fi

 curtime=`date '+%Y%m%d-%H%M%S'`
 RANK_SIZE=8
 # check dataset file
 if [ ! -d $3 ]
 then
    echo "error: DATASET_PATH=$3 is not a directory"
 exit 1
 fi

 rm ${current_exec_path}/device_parallel/ -rf
 mkdir ${current_exec_path}/device_parallel
 echo ${curtime} > ${current_exec_path}/device_parallel/starttime
 export DEVICE_NUM=$1
 export RANK_SIZE=$1

 BASEPATH=$(cd "`dirname $0`" || exit; pwd)
 export PYTHONPATH=${BASEPATH}:$PYTHONPATH
 if [ -d "../train" ];
 then
    rm -rf ../train
 fi
 mkdir ../train
 cd ../train || exit

 export CUDA_VISIBLE_DEVICES="$2"

 if [ $# == 3 ]
 then
    mpirun -n $1 --allow-run-as-root --output-filename log_output --merge-stderr-to-stdout \
    python ${BASEPATH}/../train.py \
        --GPU \
        --distributed \
        --data_path $3 > train.log 2>&1 &
 fi

 if [ $# == 4 ]
 then
    mpirun -n $1 --allow-run-as-root --output-filename log_output --merge-stderr-to-stdout \
    python ${BASEPATH}/../train.py \
        --GPU \
        --distributed \
        --data_path $3 \
        --resume $4 > train.log 2>&1 &
 fi

 mpirun --allow-run-as-root -n $RANK_SIZE python ${current_exec_path}/train.py \
                                                --GPU \
                                                --distributed \
                                                --data_path ${DATA_DIR} \
                                                --cur_time ${curtime} > ${current_exec_path}/device_parallel/efficientnet_b0.log 2>&1 &
--- a/model_zoo/official/cv/efficientnet/scripts/run_eval_for_gpu.sh
+++ b/model_zoo/official/cv/efficientnet/scripts/run_eval_for_gpu.sh
@@ -13,15 +13,34 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ============================================================================
 DATA_DIR=$1
 DEVICE_ID=$2
 PATH_CHECKPOINT=$3
 if [ $# != 2 ]
 then
    echo "GPU: sh run_eval_for_gpu.sh [DATASET_PATH] [CHECKPOINT_PATH]"
 exit 1
 fi

 current_exec_path=$(pwd)
 echo ${current_exec_path}
 # check dataset file
 if [ ! -d $1 ]
 then
    echo "error: DATASET_PATH=$1 is not a directory"
 exit 1
 fi

 curtime=`date '+%Y%m%d-%H%M%S'`
 # check checkpoint file
 if [ ! -f $2 ]
 then
    echo "error: CHECKPOINT_PATH=$2 is not a file"
 exit 1
 fi

 echo ${curtime} > ${current_exec_path}/eval_starttime
 BASEPATH=$(cd "`dirname $0`" || exit; pwd)
 export PYTHONPATH=${BASEPATH}:$PYTHONPATH

 CUDA_VISIBLE_DEVICES=${DEVICE_ID} python ./eval.py --platform 'GPU' --data_path ${DATA_DIR} --checkpoint ${PATH_CHECKPOINT} > ${current_exec_path}/eval.log 2>&1 &
 if [ -d "../eval" ];
 then
    rm -rf ../eval
 fi
 mkdir ../eval
 cd ../eval || exit

 python ${BASEPATH}/../eval.py --platform 'GPU' --data_path $1 --checkpoint=$2 > ./eval.log 2>&1 &
--- a/model_zoo/official/cv/efficientnet/scripts/run_standalone_train_for_gpu.sh
+++ b/model_zoo/official/cv/efficientnet/scripts/run_standalone_train_for_gpu.sh
@@ -13,19 +13,38 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ============================================================================
 DATA_DIR=$1
 DEVICE_ID=$2
 if [ $# != 2 ] && [ $# != 3 ]
 then
    echo "Usage: 
          sh run_standalone_train_for_gpu.sh [DEVICE_ID] [DATASET_PATH] [PRETRAINED_CKPT_PATH](optional) 
          "
 exit 1
 fi

 current_exec_path=$(pwd)
 echo ${current_exec_path}
 # check dataset file
 if [ ! -d $2 ]
 then
    echo "error: DATASET_PATH=$2 is not a directory"    
 exit 1
 fi

 curtime=`date '+%Y%m%d-%H%M%S'`
 BASEPATH=$(cd "`dirname $0`" || exit; pwd)
 export PYTHONPATH=${BASEPATH}:$PYTHONPATH
 if [ -d "../train" ];
 then
    rm -rf ../train
 fi
 mkdir ../train
 cd ../train || exit

 rm ${current_exec_path}/device_${DEVICE_ID}/ -rf
 mkdir ${current_exec_path}/device_${DEVICE_ID}
 echo ${curtime} > ${current_exec_path}/device_${DEVICE_ID}/starttime
 export CUDA_VISIBLE_DEVICES=$1

 CUDA_VISIBLE_DEVICES=${DEVICE_ID} python ${current_exec_path}/train.py \
                                         --GPU \
                                         --data_path ${DATA_DIR} \
                                         --cur_time ${curtime} > ${current_exec_path}/device_${DEVICE_ID}/efficientnet_b0.log 2>&1 &
 if [ $# == 2 ]
 then
    python ${BASEPATH}/../train.py --GPU --data_path $2 > train.log 2>&1 &
 fi

 if [ $# == 3 ]
 then
    python ${BASEPATH}/../train.py --GPU --data_path $2 --resume $3 > train.log 2>&1 &
 fi
--- a/model_zoo/official/cv/efficientnet/src/dataset.py
+++ b/model_zoo/official/cv/efficientnet/src/dataset.py
@@ -85,7 +85,6 @@ def create_dataset(batch_size, train_data_url='', workers=8, distributed=False):
                                   input_columns=["image", "label"],
                                   num_parallel_workers=2,
                                   drop_remainder=True)
    ds_train = ds_train.repeat(1)
    return ds_train


@@ -121,5 +120,4 @@ def create_dataset_val(batch_size=128, val_data_url='', workers=8, distributed=F
    dataset = dataset.map(input_columns=["label"], operations=type_cast_op, num_parallel_workers=workers)
    dataset = dataset.map(input_columns=["image"], operations=ctrans, num_parallel_workers=workers)
    dataset = dataset.batch(batch_size, drop_remainder=True, num_parallel_workers=workers)
    dataset = dataset.repeat(1)
    return dataset
--- a/model_zoo/official/cv/efficientnet/train.py
+++ b/model_zoo/official/cv/efficientnet/train.py
@@ -17,7 +17,6 @@ import argparse
 import math
 import os
 import random
 import time

 import numpy as np
 import mindspore
@@ -115,8 +114,6 @@ def main():
        if args.GPU:
            context.set_context(device_target='GPU')

    is_master = not args.distributed or (rank_id == 0)

    net = efficientnet_b0(num_classes=cfg.num_classes,
                          drop_rate=cfg.drop,
                          drop_connect_rate=cfg.drop_connect,
@@ -124,18 +121,7 @@ def main():
                          bn_tf=cfg.bn_tf,
                          )

    cur_time = args.cur_time
    output_base = './output'

    exp_name = '-'.join([
        cur_time,
        cfg.model,
        str(224)
    ])
    time.sleep(rank_id)
    output_dir = get_outdir(output_base, exp_name)

    train_data_url = os.path.join(args.data_path, 'train')
    train_data_url = args.data_path
    train_dataset = create_dataset(
        cfg.batch_size, train_data_url, workers=cfg.workers, distributed=args.distributed)
    batches_per_epoch = train_dataset.get_dataset_size()
@@ -152,7 +138,7 @@ def main():
        config_ck = CheckpointConfig(
            save_checkpoint_steps=batches_per_epoch, keep_checkpoint_max=cfg.keep_checkpoint_max)
        ckpoint_cb = ModelCheckpoint(
            prefix=cfg.model, directory=output_dir, config=config_ck)
            prefix=cfg.model, directory='./ckpt_' + str(rank_id) + '/', config=config_ck)
        callbacks += [ckpoint_cb]

    lr = Tensor(get_lr(base_lr=cfg.lr, total_epochs=cfg.epochs, steps_per_epoch=batches_per_epoch,
@@ -180,7 +166,7 @@ def main():
                  amp_level=cfg.amp_level
                  )

    callbacks = callbacks if is_master else []
 #    callbacks = callbacks if is_master else []

    if args.resume:
        real_epoch = cfg.epochs - cfg.resume_start_epoch
--- a/model_zoo/official/cv/nasnet/README_CN.md
+++ b/model_zoo/official/cv/nasnet/README_CN.md
@@ -0,0 +1,130 @@
 # NASNet示例

 <!-- TOC -->

 - [NASNet示例](#nasnet示例)
    - [概述](#概述)
    - [要求](#要求)
    - [结构](#结构)
    - [参数配置](#参数配置)
    - [运行示例](#运行示例)
        - [训练](#训练)
            - [用法](#用法)
            - [运行](#运行)
            - [结果](#结果)
        - [评估](#评估)
            - [用法](#用法-1)
            - [启动](#启动)
            - [结果](#结果-1)

 <!-- /TOC -->

 ## 概述

 此为MindSpore中训练NASNet-A-Mobile的示例。

 ## 要求

 - 安装[Mindspore](http://www.mindspore.cn/install/en)。
 - 下载数据集。

 ## 结构

 ```shell
 .
 └─nasnet      
  ├─README.md
  ├─scripts      
    ├─run_standalone_train_for_gpu.sh         # 使用GPU平台启动单机训练（单卡）
    ├─Run_distribute_train_for_gpu.sh         # 使用GPU平台启动分布式训练（8卡）
    └─Run_eval_for_gpu.sh                     # 使用GPU平台进行启动评估
  ├─src
    ├─config.py                       # 参数配置
    ├─dataset.py                      # 数据预处理
    ├─loss.py                         # 自定义交叉熵损失函数
    ├─lr_generator.py                 # 学习率生成器
    ├─nasnet_a_mobile.py              # 网络定义
  ├─eval.py                           # 评估网络
  ├─export.py                         # 转换检查点
  └─train.py                          # 训练网络
  
 ```

 ## 参数配置

 在config.py中可以同时配置训练参数和评估参数。

 ```       
 'random_seed':1,                # 固定随机种子
 'rank':0,                       # 分布式训练进程序号
 'group_size':1,                 # 分布式训练分组大小
 'work_nums':8,                  # 数据读取人员数
 'epoch_size':500,               # 总周期数
 'keep_checkpoint_max':100,      # 保存检查点最大数
 'ckpt_path':'./checkpoint/',    # 检查点保存路径
 'is_save_on_master':1           # 在rank0上保存检查点，分布式参数
 'batch_size':32,                # 输入批次大小
 'num_classes':1000,             # 数据集类数
 'label_smooth_factor':0.1,      # 标签平滑因子
 'aux_factor':0.4,               # 副对数损失系数
 'lr_init':0.04,                 # 启动学习率
 'lr_decay_rate':0.97,           # 学习率衰减率
 'num_epoch_per_decay':2.4,      # 衰减周期数
 'weight_decay':0.00004,         # 重量衰减
 'momentum':0.9,                 # 动量
 'opt_eps':1.0,                  # epsilon参数
 'rmsprop_decay':0.9,            # rmsprop衰减
 'loss_scale':1,                 # 损失规模

 ```



 ## 运行示例

 ### 训练

 #### 用法

 ```
 # 分布式训练示例（8卡）
 sh run_distribute_train_for_gpu.sh DATA_DIR 
 # 单机训练
 sh run_standalone_train_for_gpu.sh DEVICE_ID DATA_DIR
 ```

 #### 运行

 ```bash
 # GPU分布式训练示例（8卡）
 sh scripts/run_distribute_train_for_gpu.sh /dataset/train
 # GPU单机训练示例
 sh scripts/run_standalone_train_for_gpu.sh 0 /dataset/train
 ```

 #### 结果

 可以在日志中找到检查点文件及结果。

 ### 评估

 #### 用法

 ```
 # 评估
 sh run_eval_for_gpu.sh DEVICE_ID DATA_DIR PATH_CHECKPOINT
 ```

 #### 启动

 ```bash
 # 检查点评估
 sh scripts/run_eval_for_gpu.sh 0 /dataset/val ./checkpoint/nasnet-a-mobile-rank0-248_10009.ckpt
 ```

 > 训练过程中可以生成检查点。

 #### 结果

 评估结果保存在脚本路径下。路径下的日志中，可以找到如下结果：