You can not select more than 25 topics Topics must start with a chinese character,a letter or number, can include dashes ('-') and can be up to 35 characters long.

README.md 18 kB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376
  1. # Contents
  2. - [VGG Description](#vgg-description)
  3. - [Model Architecture](#model-architecture)
  4. - [Dataset](#dataset)
  5. - [Features](#features)
  6. - [Mixed Precision](#mixed-precision)
  7. - [Environment Requirements](#environment-requirements)
  8. - [Quick Start](#quick-start)
  9. - [Script Description](#script-description)
  10. - [Script and Sample Code](#script-and-sample-code)
  11. - [Script Parameters](#script-parameters)
  12. - [Parameter configuration](#parameter-configuration)
  13. - [Training Process](#training-process)
  14. - [Training](#training)
  15. - [Evaluation Process](#evaluation-process)
  16. - [Evaluation](#evaluation)
  17. - [Model Description](#model-description)
  18. - [Performance](#performance)
  19. - [Training Performance](#training-performance)
  20. - [Evaluation Performance](#evaluation-performance)
  21. - [Description of Random Situation](#description-of-random-situation)
  22. - [ModelZoo Homepage](#modelzoo-homepage)
  23. # [VGG Description](#contents)
  24. VGG, a very deep convolutional networks for large-scale image recognition, was proposed in 2014 and won the 1th place in object localization and 2th place in image classification task in ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14).
  25. [Paper](): Simonyan K, zisserman A. Very Deep Convolutional Networks for Large-Scale Image Recognition[J]. arXiv preprint arXiv:1409.1556, 2014.
  26. # [Model Architecture](#contents)
  27. VGG 16 network is mainly consisted by several basic modules (including convolution and pooling layer) and three continuous Dense layer.
  28. here basic modules mainly include basic operation like: **3×3 conv** and **2×2 max pooling**.
  29. # [Dataset](#contents)
  30. Note that you can run the scripts based on the dataset mentioned in original paper or widely used in relevant domain/network architecture. In the following sections, we will introduce how to run the scripts using the related dataset below.
  31. #### Dataset used: [CIFAR-10](<http://www.cs.toronto.edu/~kriz/cifar.html>)
  32. - CIFAR-10 Dataset size:175M,60,000 32*32 colorful images in 10 classes
  33. - Train:146M,50,000 images
  34. - Test:29.3M,10,000 images
  35. - Data format: binary files
  36. - Note: Data will be processed in src/dataset.py
  37. #### Dataset used: [ImageNet2012](http://www.image-net.org/)
  38. - Dataset size: ~146G, 1.28 million colorful images in 1000 classes
  39. - Train: 140G, 1,281,167 images
  40. - Test: 6.4G, 50, 000 images
  41. - Data format: RGB images
  42. - Note: Data will be processed in src/dataset.py
  43. #### Dataset organize way
  44. CIFAR-10
  45. > Unzip the CIFAR-10 dataset to any path you want and the folder structure should be as follows:
  46. > ```
  47. > .
  48. > ├── cifar-10-batches-bin # train dataset
  49. > └── cifar-10-verify-bin # infer dataset
  50. > ```
  51. ImageNet2012
  52. > Unzip the ImageNet2012 dataset to any path you want and the folder should include train and eval dataset as follows:
  53. >
  54. > ```
  55. > .
  56. > └─dataset
  57. > ├─ilsvrc # train dataset
  58. > └─validation_preprocess # evaluate dataset
  59. > ```
  60. # [Features](#contents)
  61. ## Mixed Precision
  62. The [mixed precision](https://www.mindspore.cn/tutorial/training/en/master/advanced_use/enable_mixed_precision.html) training method accelerates the deep learning neural network training process by using both the single-precision and half-precision data formats, and maintains the network precision achieved by the single-precision training at the same time. Mixed precision training can accelerate the computation process, reduce memory usage, and enable a larger model or batch size to be trained on specific hardware.
  63. For FP16 operators, if the input data type is FP32, the backend of MindSpore will automatically handle it with reduced precision. Users could check the reduced-precision operators by enabling INFO log and then searching ‘reduce precision’.
  64. # [Environment Requirements](#contents)
  65. - Hardware(Ascend/GPU)
  66. - Prepare hardware environment with Ascend or GPU processor. If you want to try Ascend , please send the [application form](https://obs-9be7.obs.cn-east-2.myhuaweicloud.com/file/other/Ascend%20Model%20Zoo%E4%BD%93%E9%AA%8C%E8%B5%84%E6%BA%90%E7%94%B3%E8%AF%B7%E8%A1%A8.docx) to ascend@huawei.com. Once approved, you can get the resources.
  67. - Framework
  68. - [MindSpore](https://www.mindspore.cn/install/en)
  69. - For more information, please check the resources below:
  70. - [MindSpore Tutorials](https://www.mindspore.cn/tutorial/training/en/master/index.html)
  71. - [MindSpore Python API](https://www.mindspore.cn/doc/api_python/en/master/index.html)
  72. # [Quick Start](#contents)
  73. After installing MindSpore via the official website, you can start training and evaluation as follows:
  74. - Running on Ascend
  75. ```python
  76. # run training example
  77. python train.py --data_path=[DATA_PATH] --device_id=[DEVICE_ID] > output.train.log 2>&1 &
  78. # run distributed training example
  79. sh run_distribute_train.sh [RANL_TABLE_JSON] [DATA_PATH]
  80. # run evaluation example
  81. python eval.py --data_path=[DATA_PATH] --pre_trained=[PRE_TRAINED] > output.eval.log 2>&1 &
  82. ```
  83. For distributed training, a hccl configuration file with JSON format needs to be created in advance.
  84. Please follow the instructions in the link below:
  85. https://gitee.com/mindspore/mindspore/tree/master/model_zoo/utils/hccl_tools
  86. - Running on GPU
  87. ```
  88. # run training example
  89. python train.py --device_target="GPU" --device_id=[DEVICE_ID] --dataset=[DATASET_TYPE] --data_path=[DATA_PATH] > output.train.log 2>&1 &
  90. # run distributed training example
  91. sh run_distribute_train_gpu.sh [DATA_PATH]
  92. # run evaluation example
  93. python eval.py --device_target="GPU" --device_id=[DEVICE_ID] --dataset=[DATASET_TYPE] --data_path=[DATA_PATH] --pre_trained=[PRE_TRAINED] > output.eval.log 2>&1 &
  94. ```
  95. # [Script Description](#contents)
  96. ## [Script and Sample Code](#contents)
  97. ```
  98. ├── model_zoo
  99. ├── README.md // descriptions about all the models
  100. ├── vgg16
  101. ├── README.md // descriptions about googlenet
  102. ├── scripts
  103. │ ├── run_distribute_train.sh // shell script for distributed training on Ascend
  104. │ ├── run_distribute_train_gpu.sh // shell script for distributed training on GPU
  105. ├── src
  106. │ ├── utils
  107. │ │ ├── logging.py // logging format setting
  108. │ │ ├── sampler.py // create sampler for dataset
  109. │ │ ├── util.py // util function
  110. │ │ ├── var_init.py // network parameter init method
  111. │ ├── config.py // parameter configuration
  112. │ ├── crossentropy.py // loss caculation
  113. │ ├── dataset.py // creating dataset
  114. │ ├── linear_warmup.py // linear leanring rate
  115. │ ├── warmup_cosine_annealing_lr.py // consine anealing learning rate
  116. │ ├── warmup_step_lr.py // step or multi step learning rate
  117. │ ├──vgg.py // vgg architecture
  118. ├── train.py // training script
  119. ├── eval.py // evaluation script
  120. ```
  121. ## [Script Parameters](#contents)
  122. ### Training
  123. ```
  124. usage: train.py [--device_target TARGET][--data_path DATA_PATH]
  125. [--dataset DATASET_TYPE][--is_distributed VALUE]
  126. [--device_id DEVICE_ID][--pre_trained PRE_TRAINED]
  127. [--ckpt_path CHECKPOINT_PATH][--ckpt_interval INTERVAL_STEP]
  128. parameters/options:
  129. --device_target the training backend type, Ascend or GPU, default is Ascend.
  130. --dataset the dataset type, cifar10 or imagenet2012.
  131. --is_distributed the way of traing, whether do distribute traing, value can be 0 or 1.
  132. --data_path the storage path of dataset
  133. --device_id the device which used to train model.
  134. --pre_trained the pretrained checkpoint file path.
  135. --ckpt_path the path to save checkpoint.
  136. --ckpt_interval the epoch interval for saving checkpoint.
  137. ```
  138. ### Evaluation
  139. ```
  140. usage: eval.py [--device_target TARGET][--data_path DATA_PATH]
  141. [--dataset DATASET_TYPE][--pre_trained PRE_TRAINED]
  142. [--device_id DEVICE_ID]
  143. parameters/options:
  144. --device_target the evaluation backend type, Ascend or GPU, default is Ascend.
  145. --dataset the dataset type, cifar10 or imagenet2012.
  146. --data_path the storage path of dataset.
  147. --device_id the device which used to evaluate model.
  148. --pre_trained the checkpoint file path used to evaluate model.
  149. ```
  150. ## [Parameter configuration](#contents)
  151. Parameters for both training and evaluation can be set in config.py.
  152. - config for vgg16, CIFAR-10 dataset
  153. ```
  154. "num_classes": 10, # dataset class num
  155. "lr": 0.01, # learning rate
  156. "lr_init": 0.01, # initial learning rate
  157. "lr_max": 0.1, # max learning rate
  158. "lr_epochs": '30,60,90,120', # lr changing based epochs
  159. "lr_scheduler": "step", # learning rate mode
  160. "warmup_epochs": 5, # number of warmup epoch
  161. "batch_size": 64, # batch size of input tensor
  162. "max_epoch": 70, # only valid for taining, which is always 1 for inference
  163. "momentum": 0.9, # momentum
  164. "weight_decay": 5e-4, # weight decay
  165. "loss_scale": 1.0, # loss scale
  166. "label_smooth": 0, # label smooth
  167. "label_smooth_factor": 0, # label smooth factor
  168. "buffer_size": 10, # shuffle buffer size
  169. "image_size": '224,224', # image size
  170. "pad_mode": 'same', # pad mode for conv2d
  171. "padding": 0, # padding value for conv2d
  172. "has_bias": False, # whether has bias in conv2d
  173. "batch_norm": True, # wether has batch_norm in conv2d
  174. "keep_checkpoint_max": 10, # only keep the last keep_checkpoint_max checkpoint
  175. "initialize_mode": "XavierUniform", # conv2d init mode
  176. "has_dropout": True # wether using Dropout layer
  177. ```
  178. - config for vgg16, ImageNet2012 dataset
  179. ```
  180. "num_classes": 1000, # dataset class num
  181. "lr": 0.01, # learning rate
  182. "lr_init": 0.01, # initial learning rate
  183. "lr_max": 0.1, # max learning rate
  184. "lr_epochs": '30,60,90,120', # lr changing based epochs
  185. "lr_scheduler": "cosine_annealing", # learning rate mode
  186. "warmup_epochs": 0, # number of warmup epoch
  187. "batch_size": 32, # batch size of input tensor
  188. "max_epoch": 150, # only valid for taining, which is always 1 for inference
  189. "momentum": 0.9, # momentum
  190. "weight_decay": 1e-4, # weight decay
  191. "loss_scale": 1024, # loss scale
  192. "label_smooth": 1, # label smooth
  193. "label_smooth_factor": 0.1, # label smooth factor
  194. "buffer_size": 10, # shuffle buffer size
  195. "image_size": '224,224', # image size
  196. "pad_mode": 'pad', # pad mode for conv2d
  197. "padding": 1, # padding value for conv2d
  198. "has_bias": True, # whether has bias in conv2d
  199. "batch_norm": False, # wether has batch_norm in conv2d
  200. "keep_checkpoint_max": 10, # only keep the last keep_checkpoint_max checkpoint
  201. "initialize_mode": "KaimingNormal", # conv2d init mode
  202. "has_dropout": True # wether using Dropout layer
  203. ```
  204. ## [Training Process](#contents)
  205. ### Training
  206. #### Run vgg16 on Ascend
  207. - Training using single device(1p), using CIFAR-10 dataset in default
  208. ```
  209. python train.py --data_path=your_data_path --device_id=6 > out.train.log 2>&1 &
  210. ```
  211. The python command above will run in the background, you can view the results through the file `out.train.log`.
  212. After training, you'll get some checkpoint files in specified ckpt_path, default in ./output directory.
  213. You will get the loss value as following:
  214. ```
  215. # grep "loss is " output.train.log
  216. epoch: 1 step: 781, loss is 2.093086
  217. epcoh: 2 step: 781, loss is 1.827582
  218. ...
  219. ```
  220. - Distributed Training
  221. ```
  222. sh run_distribute_train.sh rank_table.json your_data_path
  223. ```
  224. The above shell script will run distribute training in the background, you can view the results through the file `train_parallel[X]/log`.
  225. You will get the loss value as following:
  226. ```
  227. # grep "result: " train_parallel*/log
  228. train_parallel0/log:epoch: 1 step: 97, loss is 1.9060308
  229. train_parallel0/log:epcoh: 2 step: 97, loss is 1.6003821
  230. ...
  231. train_parallel1/log:epoch: 1 step: 97, loss is 1.7095519
  232. train_parallel1/log:epcoh: 2 step: 97, loss is 1.7133579
  233. ...
  234. ...
  235. ```
  236. > About rank_table.json, you can refer to the [distributed training tutorial](https://www.mindspore.cn/tutorial/training/en/master/advanced_use/distributed_training_tutorials.html).
  237. > **Attention** This will bind the processor cores according to the `device_num` and total processor numbers. If you don't expect to run pretraining with binding processor cores, remove the operations about `taskset` in `scripts/run_distribute_train.sh`
  238. #### Run vgg16 on GPU
  239. - Training using single device(1p)
  240. ```
  241. python train.py --device_target="GPU" --dataset="imagenet2012" --is_distributed=0 --data_path=$DATA_PATH > output.train.log 2>&1 &
  242. ```
  243. - Distributed Training
  244. ```
  245. # distributed training(8p)
  246. bash scripts/run_distribute_train_gpu.sh /path/ImageNet2012/train"
  247. ```
  248. ## [Evaluation Process](#contents)
  249. ### Evaluation
  250. - Do eval as follows, need to specify dataset type as "cifar10" or "imagenet2012"
  251. ```
  252. # when using cifar10 dataset
  253. python eval.py --data_path=your_data_path --dataset="cifar10" --device_target="Ascend" --pre_trained=./*-70-781.ckpt > output.eval.log 2>&1 &
  254. # when using imagenet2012 dataset
  255. python eval.py --data_path=your_data_path --dataset="imagenet2012" --device_target="GPU" --pre_trained=./*-150-5004.ckpt > output.eval.log 2>&1 &
  256. ```
  257. - The above python command will run in the background, you can view the results through the file `output.eval.log`. You will get the accuracy as following:
  258. ```
  259. # when using cifar10 dataset
  260. # grep "result: " output.eval.log
  261. result: {'acc': 0.92}
  262. # when using the imagenet2012 dataset
  263. after allreduce eval: top1_correct=36636, tot=50000, acc=73.27%
  264. after allreduce eval: top5_correct=45582, tot=50000, acc=91.16%
  265. ```
  266. # [Model Description](#contents)
  267. ## [Performance](#contents)
  268. ### Training Performance
  269. | Parameters | VGG16(Ascend) | VGG16(GPU) |
  270. | -------------------------- | ---------------------------------------------- |------------------------------------|
  271. | Model Version | VGG16 | VGG16 |
  272. | Resource | Ascend 910 ;CPU 2.60GHz,192cores;Memory,755G |NV SMX2 V100-32G |
  273. | uploaded Date | 10/28/2020 | 10/28/2020 |
  274. | MindSpore Version | 1.0.0 | 1.0.0 |
  275. | Dataset | CIFAR-10 |ImageNet2012 |
  276. | Training Parameters | epoch=70, steps=781, batch_size = 64, lr=0.1 |epoch=150, steps=40036, batch_size = 32, lr=0.1 |
  277. | Optimizer | Momentum |Momentum |
  278. | Loss Function | SoftmaxCrossEntropy |SoftmaxCrossEntropy |
  279. | outputs | probability |probability |
  280. | Loss | 0.01 |1.5~2.0 |
  281. | Speed | 1pc: 79 ms/step; 8pcs: 104 ms/step |1pc: 81 ms/step; 8pcs 94.4ms/step |
  282. | Total time | 1pc: 72 mins; 8pcs: 11.8 mins |8pcs: 19.7 hours |
  283. | Checkpoint for Fine tuning | 1.1G(.ckpt file) |1.1G(.ckpt file) |
  284. | Scripts |[vgg16](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/official/cv/vgg16) | |
  285. ### Evaluation Performance
  286. | Parameters | VGG16(Ascend) | VGG16(GPU)
  287. | ------------------- | --------------------------- |---------------------
  288. | Model Version | VGG16 | VGG16 |
  289. | Resource | Ascend 910 | GPU |
  290. | Uploaded Date | 10/28/2020 | 10/28/2020 |
  291. | MindSpore Version | 1.0.0 | 1.0.0 |
  292. | Dataset | CIFAR-10, 10,000 images |ImageNet2012, 5000 images |
  293. | batch_size | 64 | 32 |
  294. | outputs | probability | probability |
  295. | Accuracy | 1pc: 93.4% |1pc: 73.0%; |
  296. # [Description of Random Situation](#contents)
  297. In dataset.py, we set the seed inside “create_dataset" function. We also use random seed in train.py.
  298. # [ModelZoo Homepage](#contents)
  299. Please check the official [homepage](https://gitee.com/mindspore/mindspore/tree/master/model_zoo).