|
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374 |
- # Contents
-
- - [VGG Description](#vgg-description)
- - [Model Architecture](#model-architecture)
- - [Dataset](#dataset)
- - [Features](#features)
- - [Mixed Precision](#mixed-precision)
- - [Environment Requirements](#environment-requirements)
- - [Quick Start](#quick-start)
- - [Script Description](#script-description)
- - [Script and Sample Code](#script-and-sample-code)
- - [Script Parameters](#script-parameters)
- - [Parameter configuration](#parameter-configuration)
- - [Training Process](#training-process)
- - [Training](#training)
- - [Evaluation Process](#evaluation-process)
- - [Evaluation](#evaluation)
- - [Model Description](#model-description)
- - [Performance](#performance)
- - [Training Performance](#training-performance)
- - [Evaluation Performance](#evaluation-performance)
- - [Description of Random Situation](#description-of-random-situation)
- - [ModelZoo Homepage](#modelzoo-homepage)
-
-
- # [VGG Description](#contents)
-
- VGG, a very deep convolutional networks for large-scale image recognition, was proposed in 2014 and won the 1th place in object localization and 2th place in image classification task in ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14).
-
- [Paper](): Simonyan K, zisserman A. Very Deep Convolutional Networks for Large-Scale Image Recognition[J]. arXiv preprint arXiv:1409.1556, 2014.
-
- # [Model Architecture](#contents)
- VGG 16 network is mainly consisted by several basic modules (including convolution and pooling layer) and three continuous Dense layer.
- here basic modules mainly include basic operation like: **3×3 conv** and **2×2 max pooling**.
-
-
- # [Dataset](#contents)
-
- #### Dataset used: [CIFAR-10](<http://www.cs.toronto.edu/~kriz/cifar.html>)
-
- - CIFAR-10 Dataset size:175M,60,000 32*32 colorful images in 10 classes
- - Train:146M,50,000 images
- - Test:29.3M,10,000 images
- - Data format: binary files
- - Note: Data will be processed in src/dataset.py
-
- #### Dataset used: [ImageNet2012](http://www.image-net.org/)
- - Dataset size: ~146G, 1.28 million colorful images in 1000 classes
- - Train: 140G, 1,281,167 images
- - Test: 6.4G, 50, 000 images
- - Data format: RGB images
- - Note: Data will be processed in src/dataset.py
-
- #### Dataset organize way
-
- CIFAR-10
-
- > Unzip the CIFAR-10 dataset to any path you want and the folder structure should be as follows:
- > ```
- > .
- > ├── cifar-10-batches-bin # train dataset
- > └── cifar-10-verify-bin # infer dataset
- > ```
-
- ImageNet2012
-
- > Unzip the ImageNet2012 dataset to any path you want and the folder should include train and eval dataset as follows:
- >
- > ```
- > .
- > └─dataset
- > ├─ilsvrc # train dataset
- > └─validation_preprocess # evaluate dataset
- > ```
-
-
- # [Features](#contents)
-
- ## Mixed Precision
-
- The [mixed precision](https://www.mindspore.cn/tutorial/zh-CN/master/advanced_use/mixed_precision.html) training method accelerates the deep learning neural network training process by using both the single-precision and half-precision data formats, and maintains the network precision achieved by the single-precision training at the same time. Mixed precision training can accelerate the computation process, reduce memory usage, and enable a larger model or batch size to be trained on specific hardware.
-
- For FP16 operators, if the input data type is FP32, the backend of MindSpore will automatically handle it with reduced precision. Users could check the reduced-precision operators by enabling INFO log and then searching ‘reduce precision’.
-
-
- # [Environment Requirements](#contents)
-
- - Hardware(Ascend/GPU)
- - Prepare hardware environment with Ascend or GPU processor. If you want to try Ascend , please send the [application form](https://obs-9be7.obs.cn-east-2.myhuaweicloud.com/file/other/Ascend%20Model%20Zoo%E4%BD%93%E9%AA%8C%E8%B5%84%E6%BA%90%E7%94%B3%E8%AF%B7%E8%A1%A8.docx) to ascend@huawei.com. Once approved, you can get the resources.
- - Framework
- - [MindSpore](https://www.mindspore.cn/install/en)
- - For more information, please check the resources below:
- - [MindSpore tutorials](https://www.mindspore.cn/tutorial/zh-CN/master/index.html)
- - [MindSpore API](https://www.mindspore.cn/api/zh-CN/master/index.html)
-
-
- # [Quick Start](#contents)
-
- After installing MindSpore via the official website, you can start training and evaluation as follows:
-
- - Running on Ascend
- ```python
- # run training example
- python train.py --data_path=[DATA_PATH] --device_id=[DEVICE_ID] > output.train.log 2>&1 &
-
- # run distributed training example
- sh run_distribute_train.sh [RANL_TABLE_JSON] [DATA_PATH]
-
- # run evaluation example
- python eval.py --data_path=[DATA_PATH] --pre_trained=[PRE_TRAINED] > output.eval.log 2>&1 &
- ```
- For distributed training, a hccl configuration file with JSON format needs to be created in advance.
- Please follow the instructions in the link below:
- https://gitee.com/mindspore/mindspore/tree/master/model_zoo/utils/hccl_tools
-
- - Running on GPU
- ```
- # run training example
- python train.py --device_target="GPU" --device_id=[DEVICE_ID] --dataset=[DATASET_TYPE] --data_path=[DATA_PATH] > output.train.log 2>&1 &
-
- # run distributed training example
- sh run_distribute_train_gpu.sh [DATA_PATH]
-
- # run evaluation example
- python eval.py --device_target="GPU" --device_id=[DEVICE_ID] --dataset=[DATASET_TYPE] --data_path=[DATA_PATH] --pre_trained=[PRE_TRAINED] > output.eval.log 2>&1 &
- ```
-
- # [Script Description](#contents)
-
- ## [Script and Sample Code](#contents)
-
-
- ```
- ├── model_zoo
- ├── README.md // descriptions about all the models
- ├── vgg16
- ├── README.md // descriptions about googlenet
- ├── scripts
- │ ├── run_distribute_train.sh // shell script for distributed training on Ascend
- │ ├── run_distribute_train_gpu.sh // shell script for distributed training on GPU
- ├── src
- │ ├── utils
- │ │ ├── logging.py // logging format setting
- │ │ ├── sampler.py // create sampler for dataset
- │ │ ├── util.py // util function
- │ │ ├── var_init.py // network parameter init method
- │ ├── config.py // parameter configuration
- │ ├── crossentropy.py // loss caculation
- │ ├── dataset.py // creating dataset
- │ ├── linear_warmup.py // linear leanring rate
- │ ├── warmup_cosine_annealing_lr.py // consine anealing learning rate
- │ ├── warmup_step_lr.py // step or multi step learning rate
- │ ├──vgg.py // vgg architecture
- ├── train.py // training script
- ├── eval.py // evaluation script
- ```
-
- ## [Script Parameters](#contents)
-
- ### Training
- ```
- usage: train.py [--device_target TARGET][--data_path DATA_PATH]
- [--dataset DATASET_TYPE][--is_distributed VALUE]
- [--device_id DEVICE_ID][--pre_trained PRE_TRAINED]
- [--ckpt_path CHECKPOINT_PATH][--ckpt_interval INTERVAL_STEP]
-
- parameters/options:
- --device_target the training backend type, Ascend or GPU, default is Ascend.
- --dataset the dataset type, cifar10 or imagenet2012.
- --is_distributed the way of traing, whether do distribute traing, value can be 0 or 1.
- --data_path the storage path of dataset
- --device_id the device which used to train model.
- --pre_trained the pretrained checkpoint file path.
- --ckpt_path the path to save checkpoint.
- --ckpt_interval the epoch interval for saving checkpoint.
-
- ```
-
- ### Evaluation
-
- ```
- usage: eval.py [--device_target TARGET][--data_path DATA_PATH]
- [--dataset DATASET_TYPE][--pre_trained PRE_TRAINED]
- [--device_id DEVICE_ID]
-
- parameters/options:
- --device_target the evaluation backend type, Ascend or GPU, default is Ascend.
- --dataset the dataset type, cifar10 or imagenet2012.
- --data_path the storage path of dataset.
- --device_id the device which used to evaluate model.
- --pre_trained the checkpoint file path used to evaluate model.
- ```
-
- ## [Parameter configuration](#contents)
-
- Parameters for both training and evaluation can be set in config.py.
-
- - config for vgg16, CIFAR-10 dataset
-
- ```
- "num_classes": 10, # dataset class num
- "lr": 0.01, # learning rate
- "lr_init": 0.01, # initial learning rate
- "lr_max": 0.1, # max learning rate
- "lr_epochs": '30,60,90,120', # lr changing based epochs
- "lr_scheduler": "step", # learning rate mode
- "warmup_epochs": 5, # number of warmup epoch
- "batch_size": 64, # batch size of input tensor
- "max_epoch": 70, # only valid for taining, which is always 1 for inference
- "momentum": 0.9, # momentum
- "weight_decay": 5e-4, # weight decay
- "loss_scale": 1.0, # loss scale
- "label_smooth": 0, # label smooth
- "label_smooth_factor": 0, # label smooth factor
- "buffer_size": 10, # shuffle buffer size
- "image_size": '224,224', # image size
- "pad_mode": 'same', # pad mode for conv2d
- "padding": 0, # padding value for conv2d
- "has_bias": False, # whether has bias in conv2d
- "batch_norm": True, # wether has batch_norm in conv2d
- "keep_checkpoint_max": 10, # only keep the last keep_checkpoint_max checkpoint
- "initialize_mode": "XavierUniform", # conv2d init mode
- "has_dropout": True # wether using Dropout layer
- ```
-
- - config for vgg16, ImageNet2012 dataset
-
- ```
- "num_classes": 1000, # dataset class num
- "lr": 0.01, # learning rate
- "lr_init": 0.01, # initial learning rate
- "lr_max": 0.1, # max learning rate
- "lr_epochs": '30,60,90,120', # lr changing based epochs
- "lr_scheduler": "cosine_annealing", # learning rate mode
- "warmup_epochs": 0, # number of warmup epoch
- "batch_size": 32, # batch size of input tensor
- "max_epoch": 150, # only valid for taining, which is always 1 for inference
- "momentum": 0.9, # momentum
- "weight_decay": 1e-4, # weight decay
- "loss_scale": 1024, # loss scale
- "label_smooth": 1, # label smooth
- "label_smooth_factor": 0.1, # label smooth factor
- "buffer_size": 10, # shuffle buffer size
- "image_size": '224,224', # image size
- "pad_mode": 'pad', # pad mode for conv2d
- "padding": 1, # padding value for conv2d
- "has_bias": True, # whether has bias in conv2d
- "batch_norm": False, # wether has batch_norm in conv2d
- "keep_checkpoint_max": 10, # only keep the last keep_checkpoint_max checkpoint
- "initialize_mode": "KaimingNormal", # conv2d init mode
- "has_dropout": True # wether using Dropout layer
- ```
-
- ## [Training Process](#contents)
-
- ### Training
-
- #### Run vgg16 on Ascend
-
- - Training using single device(1p), using CIFAR-10 dataset in default
- ```
- python train.py --data_path=your_data_path --device_id=6 > out.train.log 2>&1 &
- ```
- The python command above will run in the background, you can view the results through the file `out.train.log`.
-
- After training, you'll get some checkpoint files in specified ckpt_path, default in ./output directory.
-
- You will get the loss value as following:
- ```
- # grep "loss is " output.train.log
- epoch: 1 step: 781, loss is 2.093086
- epcoh: 2 step: 781, loss is 1.827582
- ...
- ```
-
- - Distributed Training
- ```
- sh run_distribute_train.sh rank_table.json your_data_path
- ```
- The above shell script will run distribute training in the background, you can view the results through the file `train_parallel[X]/log`.
-
- You will get the loss value as following:
- ```
- # grep "result: " train_parallel*/log
- train_parallel0/log:epoch: 1 step: 97, loss is 1.9060308
- train_parallel0/log:epcoh: 2 step: 97, loss is 1.6003821
- ...
- train_parallel1/log:epoch: 1 step: 97, loss is 1.7095519
- train_parallel1/log:epcoh: 2 step: 97, loss is 1.7133579
- ...
- ...
- ```
- > About rank_table.json, you can refer to the [distributed training tutorial](https://www.mindspore.cn/tutorial/en/master/advanced_use/distributed_training.html).
-
-
- #### Run vgg16 on GPU
-
- - Training using single device(1p)
- ```
- python train.py --device_target="GPU" --dataset="imagenet2012" --is_distributed=0 --data_path=$DATA_PATH > output.train.log 2>&1 &
- ```
-
- - Distributed Training
- ```
- # distributed training(8p)
- bash scripts/run_distribute_train_gpu.sh /path/ImageNet2012/train"
- ```
-
- ## [Evaluation Process](#contents)
-
- ### Evaluation
-
- - Do eval as follows, need to specify dataset type as "cifar10" or "imagenet2012"
- ```
- # when using cifar10 dataset
- python eval.py --data_path=your_data_path --dataset="cifar10" --device_target="Ascend" --pre_trained=./*-70-781.ckpt > output.eval.log 2>&1 &
-
- # when using imagenet2012 dataset
- python eval.py --data_path=your_data_path --dataset="imagenet2012" --device_target="GPU" --pre_trained=./*-150-5004.ckpt > output.eval.log 2>&1 &
- ```
- - The above python command will run in the background, you can view the results through the file `output.eval.log`. You will get the accuracy as following:
- ```
- # when using cifar10 dataset
- # grep "result: " output.eval.log
- result: {'acc': 0.92}
-
- # when using the imagenet2012 dataset
- after allreduce eval: top1_correct=36636, tot=50000, acc=73.27%
- after allreduce eval: top5_correct=45582, tot=50000, acc=91.16%
- ```
-
-
- # [Model Description](#contents)
- ## [Performance](#contents)
-
- ### Training Performance
-
- | Parameters | VGG16(Ascend) | VGG16(GPU) |
- | -------------------------- | ---------------------------------------------- |------------------------------------|
- | Model Version | VGG16 | VGG16 |
- | Resource | Ascend 910 ;CPU 2.60GHz,56cores;Memory,314G |NV SMX2 V100-32G |
- | uploaded Date | 08/20/2020 |08/20/2020 |
- | MindSpore Version | 0.5.0-alpha |0.5.0-alpha |
- | Dataset | CIFAR-10 |ImageNet2012 |
- | Training Parameters | epoch=70, steps=781, batch_size = 64, lr=0.1 |epoch=150, steps=40036, batch_size = 32, lr=0.1 |
- | Optimizer | Momentum |Momentum |
- | Loss Function | SoftmaxCrossEntropy |SoftmaxCrossEntropy |
- | outputs | probability |probability |
- | Loss | 0.01 |1.5~2.0 |
- | Speed | 1pc: 79 ms/step; 8pcs: 104 ms/step |1pc: 81 ms/step; 8pcs 94.4ms/step |
- | Total time | 1pc: 72 mins; 8pcs: 11.8 mins |8pcs: 19.7 hours |
- | Checkpoint for Fine tuning | 1.1G(.ckpt file) |1.1G(.ckpt file) |
- | Scripts |[vgg16](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/official/cv/vgg16) | |
-
-
- ### Evaluation Performance
-
- | Parameters | VGG16(Ascend) | VGG16(GPU)
- | ------------------- | --------------------------- |---------------------
- | Model Version | VGG16 | VGG16 |
- | Resource | Ascend 910 | GPU |
- | Uploaded Date | 08/20/2020 | 08/20/2020 |
- | MindSpore Version | 0.5.0-alpha |0.5.0-alpha |
- | Dataset | CIFAR-10, 10,000 images |ImageNet2012, 5000 images |
- | batch_size | 64 | 32 |
- | outputs | probability | probability |
- | Accuracy | 1pc: 93.4% |1pc: 73.0%; |
-
- # [Description of Random Situation](#contents)
-
- In dataset.py, we set the seed inside “create_dataset" function. We also use random seed in train.py.
-
- # [ModelZoo Homepage](#contents)
- Please check the official [homepage](https://gitee.com/mindspore/mindspore/tree/master/model_zoo).
|