mobilenetv2+ssd gpu

5 years ago · 0d16d52d61
--- a/model_zoo/official/cv/ssd/README.md
+++ b/model_zoo/official/cv/ssd/README.md
@@ -82,7 +82,8 @@ Dataset used: [COCO2017](<http://images.cocodataset.org/>)

 # [Quick Start](#contents)

 After installing MindSpore via the official website, you can start training and evaluation on Ascend as follows: 
 After installing MindSpore via the official website, you can start training and evaluation as follows: 
 - runing on Ascend

 ```
 # distributed training on Ascend
@@ -91,6 +92,14 @@ sh run_distribute_train.sh [DEVICE_NUM] [EPOCH_SIZE] [LR] [DATASET] [RANK_TABLE_
 # run eval on Ascend
 sh run_eval.sh [DATASET] [CHECKPOINT_PATH] [DEVICE_ID]
 ```
 - runing on GPU
 ```
 # distributed training on GPU
 sh run_distribute_train_gpu.sh [DEVICE_NUM] [EPOCH_SIZE] [LR] [DATASET]

 # run eval on GPU
 sh run_eval_gpu.sh [DATASET] [CHECKPOINT_PATH] [DEVICE_ID]
 ```

 # [Script Description](#contents)

@@ -100,22 +109,24 @@ sh run_eval.sh [DATASET] [CHECKPOINT_PATH] [DEVICE_ID]
 .
 └─ cv
  └─ ssd      
    ├─ README.md                  ## descriptions about SSD
    ├─ README.md                      ## descriptions about SSD
    ├─ scripts
      └─ run_distribute_train.sh  ## shell script for distributed on ascend
      └─ run_eval.sh              ## shell script for eval on ascend
      ├─ run_distribute_train.sh      ## shell script for distributed on ascend
      ├─ run_distribute_train_gpu.sh  ## shell script for distributed on gpu
      ├─ run_eval.sh                  ## shell script for eval on ascend
      └─ run_eval_gpu.sh              ## shell script for eval on gpu
    ├─ src
      ├─ __init__.py              ## init file
      ├─ box_util.py              ## bbox utils
      ├─ coco_eval.py             ## coco metrics utils
      ├─ config.py                ## total config
      ├─ dataset.py               ## create dataset and process dataset
      ├─ init_params.py           ## parameters utils
      ├─ lr_schedule.py           ## learning ratio generator
      └─ ssd.py                   ## ssd architecture
    ├─ eval.py                    ## eval scripts
    ├─ train.py                   ## train scripts
    ├── mindspore_hub_conf.py       #  mindspore hub interface
      ├─ __init__.py                  ## init file
      ├─ box_util.py                  ## bbox utils
      ├─ coco_eval.py                 ## coco metrics utils
      ├─ config.py                    ## total config
      ├─ dataset.py                   ## create dataset and process dataset
      ├─ init_params.py               ## parameters utils
      ├─ lr_schedule.py               ## learning ratio generator
      └─ ssd.py                       ## ssd architecture
    ├─ eval.py                        ## eval scripts
    ├─ train.py                       ## train scripts
    └─ mindspore_hub_conf.py          ## mindspore hub interface
 ```

 ## [Script Parameters](#contents)
@@ -146,10 +157,9 @@ sh run_eval.sh [DATASET] [CHECKPOINT_PATH] [DEVICE_ID]

 ## [Training Process](#contents)

 ### Training on Ascend

 To train the model, run `train.py`. If the `mindrecord_dir` is empty, it will generate [mindrecord](https://www.mindspore.cn/tutorial/training/zh-CN/master/advanced_use/convert_dataset.html) files by `coco_root`(coco dataset) or `iamge_dir` and `anno_path`(own dataset). **Note if mindrecord_dir isn't empty, it will use mindrecord_dir instead of raw images.**

 ### Training on Ascend

 - Distribute mode

@@ -184,6 +194,34 @@ epoch: 500 step: 458, loss is 0.5548882
 epoch time: 39064.8467540741, per step time: 85.29442522723602
 ```

 ### Training on GPU

 - Distribute mode

 ```
    sh run_distribute_train_gpu.sh [DEVICE_NUM] [EPOCH_SIZE] [LR] [DATASET] [PRE_TRAINED](optional) [PRE_TRAINED_EPOCH_SIZE](optional)
 ```
 We need five or seven parameters for this scripts.
 - `DEVICE_NUM`: the device number for distributed train.
 - `EPOCH_NUM`: epoch num for distributed train.
 - `LR`: learning rate init value for distributed train.
 - `DATASET`：the dataset mode for distributed train.
 - `PRE_TRAINED :` the path of pretrained checkpoint file, it is better to use absolute path.
 - `PRE_TRAINED_EPOCH_SIZE :` the epoch num of pretrained.

    Training result will be stored in the current path, whose folder name is "LOG".  Under this, you can find checkpoint files together with result like the followings in log

 ```
 epoch: 1 step: 1, loss is 420.11783
 epoch: 1 step: 2, loss is 434.11032
 epoch: 1 step: 3, loss is 476.802
 ...
 epoch: 1 step: 458, loss is 3.1283689
 epoch time: 150753.701, per step time: 329.157
 ...

 ```

 ## [Evaluation Process](#contents)

 ### Evaluation on Ascend
@@ -219,41 +257,73 @@ Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.697
 mAP: 0.23808886505483504
 ```

 ### Evaluation on GPU

 ```
 sh run_eval_gpu.sh [DATASET] [CHECKPOINT_PATH] [DEVICE_ID]
 ```
 We need two parameters for this scripts.
 - `DATASET`：the dataset mode of evaluation dataset.
 - `CHECKPOINT_PATH`: the absolute path for checkpoint file.
 - `DEVICE_ID`: the device id for eval.

 > checkpoint can be produced in training process.

 Inference result will be stored in the example path, whose folder name begins with "eval". Under this, you can find result like the followings in log.

 ```
 Average Precision (AP) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.224
 Average Precision (AP) @[ IoU=0.50 | area= all | maxDets=100 ] = 0.375
 Average Precision (AP) @[ IoU=0.75 | area= all | maxDets=100 ] = 0.228
 Average Precision (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.034
 Average Precision (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.189
 Average Precision (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.407
 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 1 ] = 0.243
 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets= 10 ] = 0.382
 Average Recall (AR) @[ IoU=0.50:0.95 | area= all | maxDets=100 ] = 0.417
 Average Recall (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.120
 Average Recall (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.425
 Average Recall (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.686

 ========================================

 mAP: 0.2244936111705981
 ```

 # [Model Description](#contents)
 ## [Performance](#contents)

 ### Evaluation Performance

 | Parameters                 | Ascend                                                       |
 | -------------------------- | -------------------------------------------------------------|
 | Model Version              | SSD V1                                                       |
 | Resource                   | Ascend 910 ；CPU 2.60GHz，192cores；Memory，755G              |
 | uploaded Date              | 06/01/2020 (month/day/year)                                  |
 | MindSpore Version          | 0.3.0-alpha                                                  |
 | Dataset                    | COCO2017                                                     |
 | Training Parameters        | epoch = 500,  batch_size = 32                                |
 | Optimizer                  | Momentum                                                     |
 | Loss Function              | Sigmoid Cross Entropy,SmoothL1Loss                           |
 | Speed                      | 8pcs: 90ms/step                                              |
 | Total time                 | 8pcs: 4.81hours                                              |
 | Parameters (M)             | 34                                                           |
 | Scripts                    | https://gitee.com/mindspore/mindspore/tree/master/model_zoo/official/cv/ssd |
 | Parameters                 | Ascend                                                       | GPU                                                          |
 | -------------------------- | -------------------------------------------------------------| -------------------------------------------------------------|
 | Model Version              | SSD V1                                                       | SSD V1                                                       |
 | Resource                   | Ascend 910 ；CPU 2.60GHz，192cores；Memory，755G             | NV SMX2 V100-16G                                             |
 | uploaded Date              | 06/01/2020 (month/day/year)                                  | 09/24/2020 (month/day/year)                                  |
 | MindSpore Version          | 0.3.0-alpha                                                  | 1.0.0                                                        |
 | Dataset                    | COCO2017                                                     | COCO2017                                                     |
 | Training Parameters        | epoch = 500,  batch_size = 32                                | epoch = 800,  batch_size = 32                                |
 | Optimizer                  | Momentum                                                     | Momentum                                                     |
 | Loss Function              | Sigmoid Cross Entropy,SmoothL1Loss                           | Sigmoid Cross Entropy,SmoothL1Loss                           |
 | Speed                      | 8pcs: 90ms/step                                              | 8pcs: 121ms/step                                             |
 | Total time                 | 8pcs: 4.81hours                                              | 8pcs: 12.31hours                                              |
 | Parameters (M)             | 34                                                           | 34                                                           |
 | Scripts                    | https://gitee.com/mindspore/mindspore/tree/master/model_zoo/official/cv/ssd | https://gitee.com/mindspore/mindspore/tree/master/model_zoo/official/cv/ssd |


 ### Inference Performance

 | Parameters          | Ascend                      |
 | ------------------- | ----------------------------|
 | Model Version       | SSD V1                      |
 | Resource            | Ascend 910                  |
 | Uploaded Date       | 06/01/2020 (month/day/year) |
 | MindSpore Version   | 0.3.0-alpha                 |
 | Dataset             | COCO2017                    |
 | batch_size          | 1                           |
 | outputs             | mAP                         |
 | Accuracy            | IoU=0.50: 23.8%             |
 | Model for inference | 34M(.ckpt file)             |
 | Parameters          | Ascend                      | GPU                         |
 | ------------------- | ----------------------------| ----------------------------|
 | Model Version       | SSD V1                      | SSD V1                      |
 | Resource            | Ascend 910                  | GPU                         |
 | Uploaded Date       | 06/01/2020 (month/day/year) | 09/24/2020 (month/day/year) |
 | MindSpore Version   | 0.3.0-alpha                 | 1.0.0                       |
 | Dataset             | COCO2017                    | COCO2017                    |
 | batch_size          | 1                           | 1                           |
 | outputs             | mAP                         | mAP                         |
 | Accuracy            | IoU=0.50: 23.8%             | IoU=0.50: 22.4%             |
 | Model for inference | 34M(.ckpt file)             | 34M(.ckpt file)             |

 # [Description of Random Situation](#contents)

--- a/model_zoo/official/cv/ssd/eval.py
+++ b/model_zoo/official/cv/ssd/eval.py
@@ -71,9 +71,11 @@ if __name__ == '__main__':
    parser.add_argument("--device_id", type=int, default=0, help="Device id, default is 0.")
    parser.add_argument("--dataset", type=str, default="coco", help="Dataset, default is coco.")
    parser.add_argument("--checkpoint_path", type=str, required=True, help="Checkpoint file path.")
    parser.add_argument("--run_platform", type=str, default="Ascend", choices=("Ascend", "GPU"),
                        help="run platform, only support Ascend and GPU.")
    args_opt = parser.parse_args()

    context.set_context(mode=context.GRAPH_MODE, device_target="Ascend", device_id=args_opt.device_id)
    context.set_context(mode=context.GRAPH_MODE, device_target=args_opt.run_platform, device_id=args_opt.device_id)

    prefix = "ssd_eval.mindrecord"
    mindrecord_dir = config.mindrecord_dir
--- a/model_zoo/official/cv/ssd/scripts/run_distribute_train_gpu.sh
+++ b/model_zoo/official/cv/ssd/scripts/run_distribute_train_gpu.sh
@@ -0,0 +1,77 @@
 #!/bin/bash
 # Copyright 2020 Huawei Technologies Co., Ltd
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 # http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ============================================================================

 echo "=============================================================================================================="
 echo "Please run the scipt as: "
 echo "sh run_distribute_train_gpu.sh DEVICE_NUM EPOCH_SIZE LR DATASET PRE_TRAINED PRE_TRAINED_EPOCH_SIZE"
 echo "for example: sh run_distribute_train_gpu.sh 8 500 0.2 coco /opt/ssd-300.ckpt(optional) 200(optional)"
 echo "It is better to use absolute path."
 echo "================================================================================================================="

 if [ $# != 4 ] && [ $# != 6 ]
 then
    echo "Usage: sh run_distribute_train_gpu.sh [DEVICE_NUM] [EPOCH_SIZE] [LR] [DATASET] \
 [PRE_TRAINED](optional) [PRE_TRAINED_EPOCH_SIZE](optional)"
    exit 1
 fi

 # Before start distribute train, first create mindrecord files.
 BASE_PATH=$(cd "`dirname $0`" || exit; pwd)
 cd $BASE_PATH/../ || exit
 python train.py --only_create_dataset=True --run_platform="GPU"

 echo "After running the scipt, the network runs in the background. The log will be generated in LOG/log.txt"

 export RANK_SIZE=$1
 EPOCH_SIZE=$2
 LR=$3
 DATASET=$4
 PRE_TRAINED=$5
 PRE_TRAINED_EPOCH_SIZE=$6

 rm -rf LOG
 mkdir ./LOG
 cp ./*.py ./LOG
 cp -r ./src ./LOG
 cd ./LOG || exit

 if [ $# == 4 ]
 then
    mpirun -allow-run-as-root -n $RANK_SIZE --output-filename log_output --merge-stderr-to-stdout \
    python train.py  \
    --distribute=True  \
    --lr=$LR \
    --dataset=$DATASET \
    --device_num=$RANK_SIZE  \
    --loss_scale=1 \
    --run_platform="GPU" \
    --epoch_size=$EPOCH_SIZE > log.txt 2>&1 &
 fi

 if [ $# == 6 ]
 then
    mpirun -allow-run-as-root -n $RANK_SIZE --output-filename log_output --merge-stderr-to-stdout \
    python train.py  \
    --distribute=True  \
    --lr=$LR \
    --dataset=$DATASET \
    --device_num=$RANK_SIZE  \
    --pre_trained=$PRE_TRAINED \
    --pre_trained_epoch_size=$PRE_TRAINED_EPOCH_SIZE \
    --loss_scale=1 \
    --run_platform="GPU" \
    --epoch_size=$EPOCH_SIZE > log.txt 2>&1 &
 fi
--- a/model_zoo/official/cv/ssd/scripts/run_eval_gpu.sh
+++ b/model_zoo/official/cv/ssd/scripts/run_eval_gpu.sh
@@ -0,0 +1,66 @@
 #!/bin/bash
 # Copyright 2020 Huawei Technologies Co., Ltd
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 # http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # ============================================================================

 if [ $# != 3 ]
 then
    echo "Usage: sh run_eval_gpu.sh [DATASET] [CHECKPOINT_PATH] [DEVICE_ID]"
 exit 1
 fi

 get_real_path(){
  if [ "${1:0:1}" == "/" ]; then
    echo "$1"
  else
    echo "$(realpath -m $PWD/$1)"
  fi
 }

 DATASET=$1
 CHECKPOINT_PATH=$(get_real_path $2)
 echo $DATASET
 echo $CHECKPOINT_PATH

 if [ ! -f $CHECKPOINT_PATH ]
 then
    echo "error: CHECKPOINT_PATH=$PATH2 is not a file"
 exit 1
 fi

 export DEVICE_NUM=1
 export DEVICE_ID=$3
 export RANK_SIZE=$DEVICE_NUM
 export RANK_ID=0

 BASE_PATH=$(cd "`dirname $0`" || exit; pwd)
 cd $BASE_PATH/../ || exit

 if [ -d "eval$3" ];
 then
    rm -rf ./eval$3
 fi

 mkdir ./eval$3
 cp ./*.py ./eval$3
 cp -r ./src ./eval$3
 cd ./eval$3 || exit
 env > env.log
 echo "start infering for device $DEVICE_ID"
 python eval.py \
    --dataset=$DATASET \
    --checkpoint_path=$CHECKPOINT_PATH \
    --run_platform="GPU" \
    --device_id=$3 > log.txt 2>&1 &
 cd ..
--- a/model_zoo/official/cv/ssd/src/ssd.py
+++ b/model_zoo/official/cv/ssd/src/ssd.py
@@ -250,6 +250,8 @@ class SSD300(nn.Cell):
        pred_loc, pred_label = self.multi_box(multi_feature)
        if not self.is_training:
            pred_label = self.activation(pred_label)
        pred_loc = F.cast(pred_loc, mstype.float32)
        pred_label = F.cast(pred_label, mstype.float32)
        return pred_loc, pred_label


--- a/model_zoo/official/cv/ssd/train.py
+++ b/model_zoo/official/cv/ssd/train.py
@@ -20,12 +20,12 @@ import argparse
 import ast
 import mindspore.nn as nn
 from mindspore import context, Tensor
 from mindspore.communication.management import init
 from mindspore.communication.management import init, get_rank
 from mindspore.train.callback import CheckpointConfig, ModelCheckpoint, LossMonitor, TimeMonitor
 from mindspore.train import Model
 from mindspore.context import ParallelMode
 from mindspore.train.serialization import load_checkpoint, load_param_into_net
 from mindspore.common import set_seed
 from mindspore.common import set_seed, dtype
 from src.ssd import SSD300, SSDWithLossCell, TrainingWrapper, ssd_mobilenet_v2
 from src.config import config
 from src.dataset import create_ssd_dataset, data_to_mindrecord_byte_image, voc_data_to_mindrecord
@@ -53,20 +53,36 @@ def main():
    parser.add_argument("--loss_scale", type=int, default=1024, help="Loss scale, default is 1024.")
    parser.add_argument("--filter_weight", type=ast.literal_eval, default=False,
                        help="Filter weight parameters, default is False.")
    parser.add_argument("--run_platform", type=str, default="Ascend", choices=("Ascend", "GPU"),
                        help="run platform, only support Ascend and GPU.")
    args_opt = parser.parse_args()

    context.set_context(mode=context.GRAPH_MODE, device_target="Ascend", device_id=args_opt.device_id)

    if args_opt.distribute:
        device_num = args_opt.device_num
        context.reset_auto_parallel_context()
        context.set_auto_parallel_context(parallel_mode=ParallelMode.DATA_PARALLEL, gradients_mean=True,
                                          device_num=device_num)
    if args_opt.run_platform == "Ascend":
        context.set_context(mode=context.GRAPH_MODE, device_target="Ascend", device_id=args_opt.device_id)
        if args_opt.distribute:
            device_num = args_opt.device_num
            context.reset_auto_parallel_context()
            context.set_auto_parallel_context(parallel_mode=ParallelMode.DATA_PARALLEL, gradients_mean=True,
                                              device_num=device_num)
            init()
            rank = args_opt.device_id % device_num
        else:
            rank = 0
            device_num = 1
    elif args_opt.run_platform == "GPU":
        context.set_context(mode=context.GRAPH_MODE, device_target="GPU", device_id=args_opt.device_id)
        init()
        rank = args_opt.device_id % device_num
        if args_opt.distribute:
            device_num = args_opt.device_num
            context.reset_auto_parallel_context()
            context.set_auto_parallel_context(parallel_mode=ParallelMode.DATA_PARALLEL, gradients_mean=True,
                                              device_num=device_num)
            rank = get_rank()
        else:
            rank = 0
            device_num = 1
    else:
        rank = 0
        device_num = 1
        raise ValueError("Unsupported platform.")

    print("Start create dataset!")

@@ -113,6 +129,8 @@ def main():

        backbone = ssd_mobilenet_v2()
        ssd = SSD300(backbone=backbone, config=config)
        if args_opt.run_platform == "GPU":
            ssd.to_float(dtype.float16)
        net = SSDWithLossCell(ssd, config)
        init_net_param(net)