You can not select more than 25 topics Topics must start with a chinese character,a letter or number, can include dashes ('-') and can be up to 35 characters long.

README.md 18 kB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389
  1. # Contents
  2. - [VGG Description](#vgg-description)
  3. - [Model Architecture](#model-architecture)
  4. - [Dataset](#dataset)
  5. - [Features](#features)
  6. - [Mixed Precision](#mixed-precision)
  7. - [Environment Requirements](#environment-requirements)
  8. - [Quick Start](#quick-start)
  9. - [Script Description](#script-description)
  10. - [Script and Sample Code](#script-and-sample-code)
  11. - [Script Parameters](#script-parameters)
  12. - [Parameter configuration](#parameter-configuration)
  13. - [Training Process](#training-process)
  14. - [Training](#training)
  15. - [Evaluation Process](#evaluation-process)
  16. - [Evaluation](#evaluation)
  17. - [Model Description](#model-description)
  18. - [Performance](#performance)
  19. - [Training Performance](#training-performance)
  20. - [Evaluation Performance](#evaluation-performance)
  21. - [Description of Random Situation](#description-of-random-situation)
  22. - [ModelZoo Homepage](#modelzoo-homepage)
  23. ## [VGG Description](#contents)
  24. VGG, a very deep convolutional networks for large-scale image recognition, was proposed in 2014 and won the 1th place in object localization and 2th place in image classification task in ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14).
  25. [Paper](https://arxiv.org/abs/1409.1556): Simonyan K, zisserman A. Very Deep Convolutional Networks for Large-Scale Image Recognition[J]. arXiv preprint arXiv:1409.1556, 2014.
  26. ## [Model Architecture](#contents)
  27. VGG 16 network is mainly consisted by several basic modules (including convolution and pooling layer) and three continuous Dense layer.
  28. here basic modules mainly include basic operation like: **3×3 conv** and **2×2 max pooling**.
  29. ## [Dataset](#contents)
  30. Note that you can run the scripts based on the dataset mentioned in original paper or widely used in relevant domain/network architecture. In the following sections, we will introduce how to run the scripts using the related dataset below.
  31. ### Dataset used: [CIFAR-10](<http://www.cs.toronto.edu/~kriz/cifar.html>)
  32. - CIFAR-10 Dataset size:175M,60,000 32*32 colorful images in 10 classes
  33. - Train:146M,50,000 images
  34. - Test:29.3M,10,000 images
  35. - Data format: binary files
  36. - Note: Data will be processed in src/dataset.py
  37. ### Dataset used: [ImageNet2012](http://www.image-net.org/)
  38. - Dataset size: ~146G, 1.28 million colorful images in 1000 classes
  39. - Train: 140G, 1,281,167 images
  40. - Test: 6.4G, 50, 000 images
  41. - Data format: RGB images
  42. - Note: Data will be processed in src/dataset.py
  43. #### Dataset organize way
  44. CIFAR-10
  45. > Unzip the CIFAR-10 dataset to any path you want and the folder structure should be as follows:
  46. >
  47. > ```bash
  48. > .
  49. > ├── cifar-10-batches-bin # train dataset
  50. > └── cifar-10-verify-bin # infer dataset
  51. > ```
  52. ImageNet2012
  53. > Unzip the ImageNet2012 dataset to any path you want and the folder should include train and eval dataset as follows:
  54. >
  55. > ```bash
  56. > .
  57. > └─dataset
  58. > ├─ilsvrc # train dataset
  59. > └─validation_preprocess # evaluate dataset
  60. > ```
  61. ## [Features](#contents)
  62. ### Mixed Precision
  63. The [mixed precision](https://www.mindspore.cn/tutorial/training/en/master/advanced_use/enable_mixed_precision.html) training method accelerates the deep learning neural network training process by using both the single-precision and half-precision data formats, and maintains the network precision achieved by the single-precision training at the same time. Mixed precision training can accelerate the computation process, reduce memory usage, and enable a larger model or batch size to be trained on specific hardware.
  64. For FP16 operators, if the input data type is FP32, the backend of MindSpore will automatically handle it with reduced precision. Users could check the reduced-precision operators by enabling INFO log and then searching ‘reduce precision’.
  65. ## [Environment Requirements](#contents)
  66. - Hardware(Ascend/GPU)
  67. - Prepare hardware environment with Ascend or GPU processor.
  68. - Framework
  69. - [MindSpore](https://www.mindspore.cn/install/en)
  70. - For more information, please check the resources below:
  71. - [MindSpore Tutorials](https://www.mindspore.cn/tutorial/training/en/master/index.html)
  72. - [MindSpore Python API](https://www.mindspore.cn/doc/api_python/en/master/index.html)
  73. ## [Quick Start](#contents)
  74. After installing MindSpore via the official website, you can start training and evaluation as follows:
  75. - Running on Ascend
  76. ```python
  77. # run training example
  78. python train.py --data_path=[DATA_PATH] --device_id=[DEVICE_ID] > output.train.log 2>&1 &
  79. # run distributed training example
  80. sh run_distribute_train.sh [RANL_TABLE_JSON] [DATA_PATH]
  81. # run evaluation example
  82. python eval.py --data_path=[DATA_PATH] --pre_trained=[PRE_TRAINED] > output.eval.log 2>&1 &
  83. ```
  84. For distributed training, a hccl configuration file with JSON format needs to be created in advance.
  85. Please follow the instructions in the link below:
  86. <https://gitee.com/mindspore/mindspore/tree/master/model_zoo/utils/hccl_tools>
  87. - Running on GPU
  88. ```bash
  89. # run training example
  90. python train.py --device_target="GPU" --device_id=[DEVICE_ID] --dataset=[DATASET_TYPE] --data_path=[DATA_PATH] > output.train.log 2>&1 &
  91. # run distributed training example
  92. sh run_distribute_train_gpu.sh [DATA_PATH]
  93. # run evaluation example
  94. python eval.py --device_target="GPU" --device_id=[DEVICE_ID] --dataset=[DATASET_TYPE] --data_path=[DATA_PATH] --pre_trained=[PRE_TRAINED] > output.eval.log 2>&1 &
  95. ```
  96. ## [Script Description](#contents)
  97. ### [Script and Sample Code](#contents)
  98. ```bash
  99. ├── model_zoo
  100. ├── README.md // descriptions about all the models
  101. ├── vgg16
  102. ├── README.md // descriptions about googlenet
  103. ├── scripts
  104. │ ├── run_distribute_train.sh // shell script for distributed training on Ascend
  105. │ ├── run_distribute_train_gpu.sh // shell script for distributed training on GPU
  106. ├── src
  107. │ ├── utils
  108. │ │ ├── logging.py // logging format setting
  109. │ │ ├── sampler.py // create sampler for dataset
  110. │ │ ├── util.py // util function
  111. │ │ ├── var_init.py // network parameter init method
  112. │ ├── config.py // parameter configuration
  113. │ ├── crossentropy.py // loss calculation
  114. │ ├── dataset.py // creating dataset
  115. │ ├── linear_warmup.py // linear leanring rate
  116. │ ├── warmup_cosine_annealing_lr.py // consine anealing learning rate
  117. │ ├── warmup_step_lr.py // step or multi step learning rate
  118. │ ├──vgg.py // vgg architecture
  119. ├── train.py // training script
  120. ├── eval.py // evaluation script
  121. ```
  122. ### [Script Parameters](#contents)
  123. #### Training
  124. ```bash
  125. usage: train.py [--device_target TARGET][--data_path DATA_PATH]
  126. [--dataset DATASET_TYPE][--is_distributed VALUE]
  127. [--device_id DEVICE_ID][--pre_trained PRE_TRAINED]
  128. [--ckpt_path CHECKPOINT_PATH][--ckpt_interval INTERVAL_STEP]
  129. parameters/options:
  130. --device_target the training backend type, Ascend or GPU, default is Ascend.
  131. --dataset the dataset type, cifar10 or imagenet2012.
  132. --is_distributed the way of traing, whether do distribute traing, value can be 0 or 1.
  133. --data_path the storage path of dataset
  134. --device_id the device which used to train model.
  135. --pre_trained the pretrained checkpoint file path.
  136. --ckpt_path the path to save checkpoint.
  137. --ckpt_interval the epoch interval for saving checkpoint.
  138. ```
  139. #### Evaluation
  140. ```bash
  141. usage: eval.py [--device_target TARGET][--data_path DATA_PATH]
  142. [--dataset DATASET_TYPE][--pre_trained PRE_TRAINED]
  143. [--device_id DEVICE_ID]
  144. parameters/options:
  145. --device_target the evaluation backend type, Ascend or GPU, default is Ascend.
  146. --dataset the dataset type, cifar10 or imagenet2012.
  147. --data_path the storage path of dataset.
  148. --device_id the device which used to evaluate model.
  149. --pre_trained the checkpoint file path used to evaluate model.
  150. ```
  151. ### [Parameter configuration](#contents)
  152. Parameters for both training and evaluation can be set in config.py.
  153. - config for vgg16, CIFAR-10 dataset
  154. ```bash
  155. "num_classes": 10, # dataset class num
  156. "lr": 0.01, # learning rate
  157. "lr_init": 0.01, # initial learning rate
  158. "lr_max": 0.1, # max learning rate
  159. "lr_epochs": '30,60,90,120', # lr changing based epochs
  160. "lr_scheduler": "step", # learning rate mode
  161. "warmup_epochs": 5, # number of warmup epoch
  162. "batch_size": 64, # batch size of input tensor
  163. "max_epoch": 70, # only valid for taining, which is always 1 for inference
  164. "momentum": 0.9, # momentum
  165. "weight_decay": 5e-4, # weight decay
  166. "loss_scale": 1.0, # loss scale
  167. "label_smooth": 0, # label smooth
  168. "label_smooth_factor": 0, # label smooth factor
  169. "buffer_size": 10, # shuffle buffer size
  170. "image_size": '224,224', # image size
  171. "pad_mode": 'same', # pad mode for conv2d
  172. "padding": 0, # padding value for conv2d
  173. "has_bias": False, # whether has bias in conv2d
  174. "batch_norm": True, # whether has batch_norm in conv2d
  175. "keep_checkpoint_max": 10, # only keep the last keep_checkpoint_max checkpoint
  176. "initialize_mode": "XavierUniform", # conv2d init mode
  177. "has_dropout": True # whether using Dropout layer
  178. ```
  179. - config for vgg16, ImageNet2012 dataset
  180. ```bash
  181. "num_classes": 1000, # dataset class num
  182. "lr": 0.01, # learning rate
  183. "lr_init": 0.01, # initial learning rate
  184. "lr_max": 0.1, # max learning rate
  185. "lr_epochs": '30,60,90,120', # lr changing based epochs
  186. "lr_scheduler": "cosine_annealing", # learning rate mode
  187. "warmup_epochs": 0, # number of warmup epoch
  188. "batch_size": 32, # batch size of input tensor
  189. "max_epoch": 150, # only valid for taining, which is always 1 for inference
  190. "momentum": 0.9, # momentum
  191. "weight_decay": 1e-4, # weight decay
  192. "loss_scale": 1024, # loss scale
  193. "label_smooth": 1, # label smooth
  194. "label_smooth_factor": 0.1, # label smooth factor
  195. "buffer_size": 10, # shuffle buffer size
  196. "image_size": '224,224', # image size
  197. "pad_mode": 'pad', # pad mode for conv2d
  198. "padding": 1, # padding value for conv2d
  199. "has_bias": True, # whether has bias in conv2d
  200. "batch_norm": False, # whether has batch_norm in conv2d
  201. "keep_checkpoint_max": 10, # only keep the last keep_checkpoint_max checkpoint
  202. "initialize_mode": "KaimingNormal", # conv2d init mode
  203. "has_dropout": True # whether using Dropout layer
  204. ```
  205. ### [Training Process](#contents)
  206. #### Training
  207. ##### Run vgg16 on Ascend
  208. - Training using single device(1p), using CIFAR-10 dataset in default
  209. ```bash
  210. python train.py --data_path=your_data_path --device_id=6 > out.train.log 2>&1 &
  211. ```
  212. The python command above will run in the background, you can view the results through the file `out.train.log`.
  213. After training, you'll get some checkpoint files in specified ckpt_path, default in ./output directory.
  214. You will get the loss value as following:
  215. ```bash
  216. # grep "loss is " output.train.log
  217. epoch: 1 step: 781, loss is 2.093086
  218. epcoh: 2 step: 781, loss is 1.827582
  219. ...
  220. ```
  221. - Distributed Training
  222. ```bash
  223. sh run_distribute_train.sh rank_table.json your_data_path
  224. ```
  225. The above shell script will run distribute training in the background, you can view the results through the file `train_parallel[X]/log`.
  226. You will get the loss value as following:
  227. ```bash
  228. # grep "result: " train_parallel*/log
  229. train_parallel0/log:epoch: 1 step: 97, loss is 1.9060308
  230. train_parallel0/log:epcoh: 2 step: 97, loss is 1.6003821
  231. ...
  232. train_parallel1/log:epoch: 1 step: 97, loss is 1.7095519
  233. train_parallel1/log:epcoh: 2 step: 97, loss is 1.7133579
  234. ...
  235. ...
  236. ```
  237. > About rank_table.json, you can refer to the [distributed training tutorial](https://www.mindspore.cn/tutorial/training/en/master/advanced_use/distributed_training_tutorials.html).
  238. > **Attention** This will bind the processor cores according to the `device_num` and total processor numbers. If you don't expect to run pretraining with binding processor cores, remove the operations about `taskset` in `scripts/run_distribute_train.sh`
  239. ##### Run vgg16 on GPU
  240. - Training using single device(1p)
  241. ```bash
  242. python train.py --device_target="GPU" --dataset="imagenet2012" --is_distributed=0 --data_path=$DATA_PATH > output.train.log 2>&1 &
  243. ```
  244. - Distributed Training
  245. ```bash
  246. # distributed training(8p)
  247. bash scripts/run_distribute_train_gpu.sh /path/ImageNet2012/train"
  248. ```
  249. ### [Evaluation Process](#contents)
  250. #### Evaluation
  251. - Do eval as follows, need to specify dataset type as "cifar10" or "imagenet2012"
  252. ```bash
  253. # when using cifar10 dataset
  254. python eval.py --data_path=your_data_path --dataset="cifar10" --device_target="Ascend" --pre_trained=./*-70-781.ckpt > output.eval.log 2>&1 &
  255. # when using imagenet2012 dataset
  256. python eval.py --data_path=your_data_path --dataset="imagenet2012" --device_target="GPU" --pre_trained=./*-150-5004.ckpt > output.eval.log 2>&1 &
  257. ```
  258. - The above python command will run in the background, you can view the results through the file `output.eval.log`. You will get the accuracy as following:
  259. ```bash
  260. # when using cifar10 dataset
  261. # grep "result: " output.eval.log
  262. result: {'acc': 0.92}
  263. # when using the imagenet2012 dataset
  264. after allreduce eval: top1_correct=36636, tot=50000, acc=73.27%
  265. after allreduce eval: top5_correct=45582, tot=50000, acc=91.16%
  266. ```
  267. ## [Model Description](#contents)
  268. ### [Performance](#contents)
  269. #### Training Performance
  270. | Parameters | VGG16(Ascend) | VGG16(GPU) |
  271. | -------------------------- | ---------------------------------------------- |------------------------------------|
  272. | Model Version | VGG16 | VGG16 |
  273. | Resource | Ascend 910 ;CPU 2.60GHz,192cores;Memory,755G |NV SMX2 V100-32G |
  274. | uploaded Date | 10/28/2020 | 10/28/2020 |
  275. | MindSpore Version | 1.0.0 | 1.0.0 |
  276. | Dataset | CIFAR-10 |ImageNet2012 |
  277. | Training Parameters | epoch=70, steps=781, batch_size = 64, lr=0.1 |epoch=150, steps=40036, batch_size = 32, lr=0.1 |
  278. | Optimizer | Momentum |Momentum |
  279. | Loss Function | SoftmaxCrossEntropy |SoftmaxCrossEntropy |
  280. | outputs | probability |probability |
  281. | Loss | 0.01 |1.5~2.0 |
  282. | Speed | 1pc: 79 ms/step; 8pcs: 104 ms/step |1pc: 81 ms/step; 8pcs 94.4ms/step |
  283. | Total time | 1pc: 72 mins; 8pcs: 11.8 mins |8pcs: 19.7 hours |
  284. | Checkpoint for Fine tuning | 1.1G(.ckpt file) |1.1G(.ckpt file) |
  285. | Scripts |[vgg16](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/official/cv/vgg16) | |
  286. #### Evaluation Performance
  287. | Parameters | VGG16(Ascend) | VGG16(GPU)
  288. | ------------------- | --------------------------- |---------------------
  289. | Model Version | VGG16 | VGG16 |
  290. | Resource | Ascend 910 | GPU |
  291. | Uploaded Date | 10/28/2020 | 10/28/2020 |
  292. | MindSpore Version | 1.0.0 | 1.0.0 |
  293. | Dataset | CIFAR-10, 10,000 images |ImageNet2012, 5000 images |
  294. | batch_size | 64 | 32 |
  295. | outputs | probability | probability |
  296. | Accuracy | 1pc: 93.4% |1pc: 73.0%; |
  297. ## [Description of Random Situation](#contents)
  298. In dataset.py, we set the seed inside “create_dataset" function. We also use random seed in train.py.
  299. ## [ModelZoo Homepage](#contents)
  300. Please check the official [homepage](https://gitee.com/mindspore/mindspore/tree/master/model_zoo).