You can not select more than 25 topics Topics must start with a chinese character,a letter or number, can include dashes ('-') and can be up to 35 characters long.

README.md 18 kB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374
  1. # Contents
  2. - [VGG Description](#vgg-description)
  3. - [Model Architecture](#model-architecture)
  4. - [Dataset](#dataset)
  5. - [Features](#features)
  6. - [Mixed Precision](#mixed-precision)
  7. - [Environment Requirements](#environment-requirements)
  8. - [Quick Start](#quick-start)
  9. - [Script Description](#script-description)
  10. - [Script and Sample Code](#script-and-sample-code)
  11. - [Script Parameters](#script-parameters)
  12. - [Parameter configuration](#parameter-configuration)
  13. - [Training Process](#training-process)
  14. - [Training](#training)
  15. - [Evaluation Process](#evaluation-process)
  16. - [Evaluation](#evaluation)
  17. - [Model Description](#model-description)
  18. - [Performance](#performance)
  19. - [Training Performance](#training-performance)
  20. - [Evaluation Performance](#evaluation-performance)
  21. - [Description of Random Situation](#description-of-random-situation)
  22. - [ModelZoo Homepage](#modelzoo-homepage)
  23. # [VGG Description](#contents)
  24. VGG, a very deep convolutional networks for large-scale image recognition, was proposed in 2014 and won the 1th place in object localization and 2th place in image classification task in ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14).
  25. [Paper](): Simonyan K, zisserman A. Very Deep Convolutional Networks for Large-Scale Image Recognition[J]. arXiv preprint arXiv:1409.1556, 2014.
  26. # [Model Architecture](#contents)
  27. VGG 16 network is mainly consisted by several basic modules (including convolution and pooling layer) and three continuous Dense layer.
  28. here basic modules mainly include basic operation like: **3×3 conv** and **2×2 max pooling**.
  29. # [Dataset](#contents)
  30. #### Dataset used: [CIFAR-10](<http://www.cs.toronto.edu/~kriz/cifar.html>)
  31. - CIFAR-10 Dataset size:175M,60,000 32*32 colorful images in 10 classes
  32. - Train:146M,50,000 images
  33. - Test:29.3M,10,000 images
  34. - Data format: binary files
  35. - Note: Data will be processed in src/dataset.py
  36. #### Dataset used: [ImageNet2012](http://www.image-net.org/)
  37. - Dataset size: ~146G, 1.28 million colorful images in 1000 classes
  38. - Train: 140G, 1,281,167 images
  39. - Test: 6.4G, 50, 000 images
  40. - Data format: RGB images
  41. - Note: Data will be processed in src/dataset.py
  42. #### Dataset organize way
  43. CIFAR-10
  44. > Unzip the CIFAR-10 dataset to any path you want and the folder structure should be as follows:
  45. > ```
  46. > .
  47. > ├── cifar-10-batches-bin # train dataset
  48. > └── cifar-10-verify-bin # infer dataset
  49. > ```
  50. ImageNet2012
  51. > Unzip the ImageNet2012 dataset to any path you want and the folder should include train and eval dataset as follows:
  52. >
  53. > ```
  54. > .
  55. > └─dataset
  56. > ├─ilsvrc # train dataset
  57. > └─validation_preprocess # evaluate dataset
  58. > ```
  59. # [Features](#contents)
  60. ## Mixed Precision
  61. The [mixed precision](https://www.mindspore.cn/tutorial/zh-CN/master/advanced_use/mixed_precision.html) training method accelerates the deep learning neural network training process by using both the single-precision and half-precision data formats, and maintains the network precision achieved by the single-precision training at the same time. Mixed precision training can accelerate the computation process, reduce memory usage, and enable a larger model or batch size to be trained on specific hardware.
  62. For FP16 operators, if the input data type is FP32, the backend of MindSpore will automatically handle it with reduced precision. Users could check the reduced-precision operators by enabling INFO log and then searching ‘reduce precision’.
  63. # [Environment Requirements](#contents)
  64. - Hardware(Ascend/GPU)
  65. - Prepare hardware environment with Ascend or GPU processor. If you want to try Ascend , please send the [application form](https://obs-9be7.obs.cn-east-2.myhuaweicloud.com/file/other/Ascend%20Model%20Zoo%E4%BD%93%E9%AA%8C%E8%B5%84%E6%BA%90%E7%94%B3%E8%AF%B7%E8%A1%A8.docx) to ascend@huawei.com. Once approved, you can get the resources.
  66. - Framework
  67. - [MindSpore](https://www.mindspore.cn/install/en)
  68. - For more information, please check the resources below:
  69. - [MindSpore tutorials](https://www.mindspore.cn/tutorial/zh-CN/master/index.html)
  70. - [MindSpore API](https://www.mindspore.cn/api/zh-CN/master/index.html)
  71. # [Quick Start](#contents)
  72. After installing MindSpore via the official website, you can start training and evaluation as follows:
  73. - Running on Ascend
  74. ```python
  75. # run training example
  76. python train.py --data_path=[DATA_PATH] --device_id=[DEVICE_ID] > output.train.log 2>&1 &
  77. # run distributed training example
  78. sh run_distribute_train.sh [RANL_TABLE_JSON] [DATA_PATH]
  79. # run evaluation example
  80. python eval.py --data_path=[DATA_PATH] --pre_trained=[PRE_TRAINED] > output.eval.log 2>&1 &
  81. ```
  82. For distributed training, a hccl configuration file with JSON format needs to be created in advance.
  83. Please follow the instructions in the link below:
  84. https://gitee.com/mindspore/mindspore/tree/master/model_zoo/utils/hccl_tools
  85. - Running on GPU
  86. ```
  87. # run training example
  88. python train.py --device_target="GPU" --device_id=[DEVICE_ID] --dataset=[DATASET_TYPE] --data_path=[DATA_PATH] > output.train.log 2>&1 &
  89. # run distributed training example
  90. sh run_distribute_train_gpu.sh [DATA_PATH]
  91. # run evaluation example
  92. python eval.py --device_target="GPU" --device_id=[DEVICE_ID] --dataset=[DATASET_TYPE] --data_path=[DATA_PATH] --pre_trained=[PRE_TRAINED] > output.eval.log 2>&1 &
  93. ```
  94. # [Script Description](#contents)
  95. ## [Script and Sample Code](#contents)
  96. ```
  97. ├── model_zoo
  98. ├── README.md // descriptions about all the models
  99. ├── vgg16
  100. ├── README.md // descriptions about googlenet
  101. ├── scripts
  102. │ ├── run_distribute_train.sh // shell script for distributed training on Ascend
  103. │ ├── run_distribute_train_gpu.sh // shell script for distributed training on GPU
  104. ├── src
  105. │ ├── utils
  106. │ │ ├── logging.py // logging format setting
  107. │ │ ├── sampler.py // create sampler for dataset
  108. │ │ ├── util.py // util function
  109. │ │ ├── var_init.py // network parameter init method
  110. │ ├── config.py // parameter configuration
  111. │ ├── crossentropy.py // loss caculation
  112. │ ├── dataset.py // creating dataset
  113. │ ├── linear_warmup.py // linear leanring rate
  114. │ ├── warmup_cosine_annealing_lr.py // consine anealing learning rate
  115. │ ├── warmup_step_lr.py // step or multi step learning rate
  116. │ ├──vgg.py // vgg architecture
  117. ├── train.py // training script
  118. ├── eval.py // evaluation script
  119. ```
  120. ## [Script Parameters](#contents)
  121. ### Training
  122. ```
  123. usage: train.py [--device_target TARGET][--data_path DATA_PATH]
  124. [--dataset DATASET_TYPE][--is_distributed VALUE]
  125. [--device_id DEVICE_ID][--pre_trained PRE_TRAINED]
  126. [--ckpt_path CHECKPOINT_PATH][--ckpt_interval INTERVAL_STEP]
  127. parameters/options:
  128. --device_target the training backend type, Ascend or GPU, default is Ascend.
  129. --dataset the dataset type, cifar10 or imagenet2012.
  130. --is_distributed the way of traing, whether do distribute traing, value can be 0 or 1.
  131. --data_path the storage path of dataset
  132. --device_id the device which used to train model.
  133. --pre_trained the pretrained checkpoint file path.
  134. --ckpt_path the path to save checkpoint.
  135. --ckpt_interval the epoch interval for saving checkpoint.
  136. ```
  137. ### Evaluation
  138. ```
  139. usage: eval.py [--device_target TARGET][--data_path DATA_PATH]
  140. [--dataset DATASET_TYPE][--pre_trained PRE_TRAINED]
  141. [--device_id DEVICE_ID]
  142. parameters/options:
  143. --device_target the evaluation backend type, Ascend or GPU, default is Ascend.
  144. --dataset the dataset type, cifar10 or imagenet2012.
  145. --data_path the storage path of dataset.
  146. --device_id the device which used to evaluate model.
  147. --pre_trained the checkpoint file path used to evaluate model.
  148. ```
  149. ## [Parameter configuration](#contents)
  150. Parameters for both training and evaluation can be set in config.py.
  151. - config for vgg16, CIFAR-10 dataset
  152. ```
  153. "num_classes": 10, # dataset class num
  154. "lr": 0.01, # learning rate
  155. "lr_init": 0.01, # initial learning rate
  156. "lr_max": 0.1, # max learning rate
  157. "lr_epochs": '30,60,90,120', # lr changing based epochs
  158. "lr_scheduler": "step", # learning rate mode
  159. "warmup_epochs": 5, # number of warmup epoch
  160. "batch_size": 64, # batch size of input tensor
  161. "max_epoch": 70, # only valid for taining, which is always 1 for inference
  162. "momentum": 0.9, # momentum
  163. "weight_decay": 5e-4, # weight decay
  164. "loss_scale": 1.0, # loss scale
  165. "label_smooth": 0, # label smooth
  166. "label_smooth_factor": 0, # label smooth factor
  167. "buffer_size": 10, # shuffle buffer size
  168. "image_size": '224,224', # image size
  169. "pad_mode": 'same', # pad mode for conv2d
  170. "padding": 0, # padding value for conv2d
  171. "has_bias": False, # whether has bias in conv2d
  172. "batch_norm": True, # wether has batch_norm in conv2d
  173. "keep_checkpoint_max": 10, # only keep the last keep_checkpoint_max checkpoint
  174. "initialize_mode": "XavierUniform", # conv2d init mode
  175. "has_dropout": True # wether using Dropout layer
  176. ```
  177. - config for vgg16, ImageNet2012 dataset
  178. ```
  179. "num_classes": 1000, # dataset class num
  180. "lr": 0.01, # learning rate
  181. "lr_init": 0.01, # initial learning rate
  182. "lr_max": 0.1, # max learning rate
  183. "lr_epochs": '30,60,90,120', # lr changing based epochs
  184. "lr_scheduler": "cosine_annealing", # learning rate mode
  185. "warmup_epochs": 0, # number of warmup epoch
  186. "batch_size": 32, # batch size of input tensor
  187. "max_epoch": 150, # only valid for taining, which is always 1 for inference
  188. "momentum": 0.9, # momentum
  189. "weight_decay": 1e-4, # weight decay
  190. "loss_scale": 1024, # loss scale
  191. "label_smooth": 1, # label smooth
  192. "label_smooth_factor": 0.1, # label smooth factor
  193. "buffer_size": 10, # shuffle buffer size
  194. "image_size": '224,224', # image size
  195. "pad_mode": 'pad', # pad mode for conv2d
  196. "padding": 1, # padding value for conv2d
  197. "has_bias": True, # whether has bias in conv2d
  198. "batch_norm": False, # wether has batch_norm in conv2d
  199. "keep_checkpoint_max": 10, # only keep the last keep_checkpoint_max checkpoint
  200. "initialize_mode": "KaimingNormal", # conv2d init mode
  201. "has_dropout": True # wether using Dropout layer
  202. ```
  203. ## [Training Process](#contents)
  204. ### Training
  205. #### Run vgg16 on Ascend
  206. - Training using single device(1p), using CIFAR-10 dataset in default
  207. ```
  208. python train.py --data_path=your_data_path --device_id=6 > out.train.log 2>&1 &
  209. ```
  210. The python command above will run in the background, you can view the results through the file `out.train.log`.
  211. After training, you'll get some checkpoint files in specified ckpt_path, default in ./output directory.
  212. You will get the loss value as following:
  213. ```
  214. # grep "loss is " output.train.log
  215. epoch: 1 step: 781, loss is 2.093086
  216. epcoh: 2 step: 781, loss is 1.827582
  217. ...
  218. ```
  219. - Distributed Training
  220. ```
  221. sh run_distribute_train.sh rank_table.json your_data_path
  222. ```
  223. The above shell script will run distribute training in the background, you can view the results through the file `train_parallel[X]/log`.
  224. You will get the loss value as following:
  225. ```
  226. # grep "result: " train_parallel*/log
  227. train_parallel0/log:epoch: 1 step: 97, loss is 1.9060308
  228. train_parallel0/log:epcoh: 2 step: 97, loss is 1.6003821
  229. ...
  230. train_parallel1/log:epoch: 1 step: 97, loss is 1.7095519
  231. train_parallel1/log:epcoh: 2 step: 97, loss is 1.7133579
  232. ...
  233. ...
  234. ```
  235. > About rank_table.json, you can refer to the [distributed training tutorial](https://www.mindspore.cn/tutorial/en/master/advanced_use/distributed_training.html).
  236. #### Run vgg16 on GPU
  237. - Training using single device(1p)
  238. ```
  239. python train.py --device_target="GPU" --dataset="imagenet2012" --is_distributed=0 --data_path=$DATA_PATH > output.train.log 2>&1 &
  240. ```
  241. - Distributed Training
  242. ```
  243. # distributed training(8p)
  244. bash scripts/run_distribute_train_gpu.sh /path/ImageNet2012/train"
  245. ```
  246. ## [Evaluation Process](#contents)
  247. ### Evaluation
  248. - Do eval as follows, need to specify dataset type as "cifar10" or "imagenet2012"
  249. ```
  250. # when using cifar10 dataset
  251. python eval.py --data_path=your_data_path --dataset="cifar10" --device_target="Ascend" --pre_trained=./*-70-781.ckpt > output.eval.log 2>&1 &
  252. # when using imagenet2012 dataset
  253. python eval.py --data_path=your_data_path --dataset="imagenet2012" --device_target="GPU" --pre_trained=./*-150-5004.ckpt > output.eval.log 2>&1 &
  254. ```
  255. - The above python command will run in the background, you can view the results through the file `output.eval.log`. You will get the accuracy as following:
  256. ```
  257. # when using cifar10 dataset
  258. # grep "result: " output.eval.log
  259. result: {'acc': 0.92}
  260. # when using the imagenet2012 dataset
  261. after allreduce eval: top1_correct=36636, tot=50000, acc=73.27%
  262. after allreduce eval: top5_correct=45582, tot=50000, acc=91.16%
  263. ```
  264. # [Model Description](#contents)
  265. ## [Performance](#contents)
  266. ### Training Performance
  267. | Parameters | VGG16(Ascend) | VGG16(GPU) |
  268. | -------------------------- | ---------------------------------------------- |------------------------------------|
  269. | Model Version | VGG16 | VGG16 |
  270. | Resource | Ascend 910 ;CPU 2.60GHz,56cores;Memory,314G |NV SMX2 V100-32G |
  271. | uploaded Date | 08/20/2020 |08/20/2020 |
  272. | MindSpore Version | 0.5.0-alpha |0.5.0-alpha |
  273. | Dataset | CIFAR-10 |ImageNet2012 |
  274. | Training Parameters | epoch=70, steps=781, batch_size = 64, lr=0.1 |epoch=150, steps=40036, batch_size = 32, lr=0.1 |
  275. | Optimizer | Momentum |Momentum |
  276. | Loss Function | SoftmaxCrossEntropy |SoftmaxCrossEntropy |
  277. | outputs | probability |probability |
  278. | Loss | 0.01 |1.5~2.0 |
  279. | Speed | 1pc: 79 ms/step; 8pcs: 104 ms/step |1pc: 81 ms/step; 8pcs 94.4ms/step |
  280. | Total time | 1pc: 72 mins; 8pcs: 11.8 mins |8pcs: 19.7 hours |
  281. | Checkpoint for Fine tuning | 1.1G(.ckpt file) |1.1G(.ckpt file) |
  282. | Scripts |[vgg16](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/official/cv/vgg16) | |
  283. ### Evaluation Performance
  284. | Parameters | VGG16(Ascend) | VGG16(GPU)
  285. | ------------------- | --------------------------- |---------------------
  286. | Model Version | VGG16 | VGG16 |
  287. | Resource | Ascend 910 | GPU |
  288. | Uploaded Date | 08/20/2020 | 08/20/2020 |
  289. | MindSpore Version | 0.5.0-alpha |0.5.0-alpha |
  290. | Dataset | CIFAR-10, 10,000 images |ImageNet2012, 5000 images |
  291. | batch_size | 64 | 32 |
  292. | outputs | probability | probability |
  293. | Accuracy | 1pc: 93.4% |1pc: 73.0%; |
  294. # [Description of Random Situation](#contents)
  295. In dataset.py, we set the seed inside “create_dataset" function. We also use random seed in train.py.
  296. # [ModelZoo Homepage](#contents)
  297. Please check the official [homepage](https://gitee.com/mindspore/mindspore/tree/master/model_zoo).