You can not select more than 25 topics Topics must start with a chinese character,a letter or number, can include dashes ('-') and can be up to 35 characters long.

README.md 8.1 kB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225
  1. # VGG16 Example
  2. ## Description
  3. This example is for VGG16 model training and evaluation.
  4. ## Requirements
  5. - Install [MindSpore](https://www.mindspore.cn/install/en).
  6. - Download the dataset CIFAR-10 or ImageNet2012.
  7. CIFAR-10
  8. > Unzip the CIFAR-10 dataset to any path you want and the folder structure should be as follows:
  9. > ```
  10. > .
  11. > ├── cifar-10-batches-bin # train dataset
  12. > └── cifar-10-verify-bin # infer dataset
  13. > ```
  14. ImageNet2012
  15. > Unzip the ImageNet2012 dataset to any path you want and the folder should include train and eval dataset as follows:
  16. >
  17. > ```
  18. > .
  19. > └─dataset
  20. > ├─ilsvrc # train dataset
  21. > └─validation_preprocess # evaluate dataset
  22. > ```
  23. ## Parameter configuration
  24. Parameters for both training and evaluation can be set in config.py.
  25. - config for vgg16, CIFAR-10 dataset
  26. ```
  27. "num_classes": 10, # dataset class num
  28. "lr": 0.01, # learning rate
  29. "lr_init": 0.01, # initial learning rate
  30. "lr_max": 0.1, # max learning rate
  31. "lr_epochs": '30,60,90,120', # lr changing based epochs
  32. "lr_scheduler": "step", # learning rate mode
  33. "warmup_epochs": 5, # number of warmup epoch
  34. "batch_size": 64, # batch size of input tensor
  35. "max_epoch": 70, # only valid for taining, which is always 1 for inference
  36. "momentum": 0.9, # momentum
  37. "weight_decay": 5e-4, # weight decay
  38. "loss_scale": 1.0, # loss scale
  39. "label_smooth": 0, # label smooth
  40. "label_smooth_factor": 0, # label smooth factor
  41. "buffer_size": 10, # shuffle buffer size
  42. "image_size": '224,224', # image size
  43. "pad_mode": 'same', # pad mode for conv2d
  44. "padding": 0, # padding value for conv2d
  45. "has_bias": False, # whether has bias in conv2d
  46. "batch_norm": True, # wether has batch_norm in conv2d
  47. "keep_checkpoint_max": 10, # only keep the last keep_checkpoint_max checkpoint
  48. "initialize_mode": "XavierUniform", # conv2d init mode
  49. "has_dropout": True # wether using Dropout layer
  50. ```
  51. - config for vgg16, ImageNet2012 dataset
  52. ```
  53. "num_classes": 1000, # dataset class num
  54. "lr": 0.01, # learning rate
  55. "lr_init": 0.01, # initial learning rate
  56. "lr_max": 0.1, # max learning rate
  57. "lr_epochs": '30,60,90,120', # lr changing based epochs
  58. "lr_scheduler": "cosine_annealing", # learning rate mode
  59. "warmup_epochs": 0, # number of warmup epoch
  60. "batch_size": 32, # batch size of input tensor
  61. "max_epoch": 150, # only valid for taining, which is always 1 for inference
  62. "momentum": 0.9, # momentum
  63. "weight_decay": 1e-4, # weight decay
  64. "loss_scale": 1024, # loss scale
  65. "label_smooth": 1, # label smooth
  66. "label_smooth_factor": 0.1, # label smooth factor
  67. "buffer_size": 10, # shuffle buffer size
  68. "image_size": '224,224', # image size
  69. "pad_mode": 'pad', # pad mode for conv2d
  70. "padding": 1, # padding value for conv2d
  71. "has_bias": True, # whether has bias in conv2d
  72. "batch_norm": False, # wether has batch_norm in conv2d
  73. "keep_checkpoint_max": 10, # only keep the last keep_checkpoint_max checkpoint
  74. "initialize_mode": "KaimingNormal", # conv2d init mode
  75. "has_dropout": True # wether using Dropout layer
  76. ```
  77. ## Running the Example
  78. ### Training
  79. **Run vgg16, using CIFAR-10 dataset**
  80. - Training using single device(1p)
  81. ```
  82. python train.py --data_path=your_data_path --device_id=6 > out.train.log 2>&1 &
  83. ```
  84. The python command above will run in the background, you can view the results through the file `out.train.log`.
  85. After training, you'll get some checkpoint files in specified ckpt_path, default in ./output directory.
  86. You will get the loss value as following:
  87. ```
  88. # grep "loss is " out.train.log
  89. epoch: 1 step: 781, loss is 2.093086
  90. epcoh: 2 step: 781, loss is 1.827582
  91. ...
  92. ```
  93. - Distribute Training
  94. ```
  95. sh run_distribute_train.sh rank_table.json your_data_path
  96. ```
  97. The above shell script will run distribute training in the background, you can view the results through the file `train_parallel[X]/log`.
  98. You will get the loss value as following:
  99. ```
  100. # grep "result: " train_parallel*/log
  101. train_parallel0/log:epoch: 1 step: 97, loss is 1.9060308
  102. train_parallel0/log:epcoh: 2 step: 97, loss is 1.6003821
  103. ...
  104. train_parallel1/log:epoch: 1 step: 97, loss is 1.7095519
  105. train_parallel1/log:epcoh: 2 step: 97, loss is 1.7133579
  106. ...
  107. ...
  108. ```
  109. > About rank_table.json, you can refer to the [distributed training tutorial](https://www.mindspore.cn/tutorial/en/master/advanced_use/distributed_training.html).
  110. **Run vgg16, using imagenet2012 dataset**
  111. - Training using single device(1p)
  112. ```
  113. python train.py --device_target="GPU" --dataset="imagenet2012" --is_distributed=0 --data_path=$DATA_PATH > output.train.log 2>&1 &
  114. ```
  115. - Distribute Training
  116. ```
  117. # distributed training(8p)
  118. bash scripts/run_distribute_train_gpu.sh /path/ImageNet2012/train"
  119. ```
  120. ### Evaluation
  121. - Do eval as follows, need to specify dataset type as "cifar10" or "imagenet2012"
  122. ```
  123. # when using cifar10 dataset
  124. python eval.py --data_path=your_data_path --dataset="cifar10" --device_target="Ascend" --pre_trained=./*-70-781.ckpt > out.eval.log 2>&1 &
  125. # when using imagenet2012 dataset
  126. python eval.py --data_path=your_data_path --dataset="imagenet2012" --device_target="GPU" --pre_trained=./*-150-5004.ckpt > out.eval.log 2>&1 &
  127. ```
  128. - If the using dataset is
  129. The above python command will run in the background, you can view the results through the file `out.eval.log`.
  130. You will get the accuracy as following:
  131. ```
  132. # when using cifar10 dataset
  133. # grep "result: " out.eval.log
  134. result: {'acc': 0.92}
  135. # when using the imagenet2012 dataset
  136. after allreduce eval: top1_correct=36636, tot=50000, acc=73.27%
  137. after allreduce eval: top5_correct=45582, tot=50000, acc=91.16%
  138. ```
  139. ## Usage:
  140. ### Training
  141. ```
  142. usage: train.py [--device_target TARGET][--data_path DATA_PATH]
  143. [--dataset DATASET_TYPE][--is_distributed VALUE]
  144. [--device_id DEVICE_ID][--pre_trained PRE_TRAINED]
  145. [--ckpt_path CHECKPOINT_PATH][--ckpt_interval INTERVAL_STEP]
  146. parameters/options:
  147. --device_target the training backend type, Ascend or GPU, default is Ascend.
  148. --dataset the dataset type, cifar10 or imagenet2012.
  149. --is_distributed the way of traing, whether do distribute traing, value can be 0 or 1.
  150. --data_path the storage path of dataset
  151. --device_id the device which used to train model.
  152. --pre_trained the pretrained checkpoint file path.
  153. --ckpt_path the path to save checkpoint.
  154. --ckpt_interval the epoch interval for saving checkpoint.
  155. ```
  156. ### Evaluation
  157. ```
  158. usage: eval.py [--device_target TARGET][--data_path DATA_PATH]
  159. [--dataset DATASET_TYPE][--pre_trained PRE_TRAINED]
  160. [--device_id DEVICE_ID]
  161. parameters/options:
  162. --device_target the evaluation backend type, Ascend or GPU, default is Ascend.
  163. --dataset the dataset type, cifar10 or imagenet2012.
  164. --data_path the storage path of dataset.
  165. --device_id the device which used to evaluate model.
  166. --pre_trained the checkpoint file path used to evaluate model.
  167. ```
  168. ### Distribute Training
  169. - Train on Ascend.
  170. ```
  171. Usage: sh script/run_distribute_train.sh [RANK_TABLE_FILE] [DATA_PATH]
  172. parameters/options:
  173. RANK_TABLE_FILE HCCL configuration file path.
  174. DATA_PATH the storage path of dataset.
  175. ```
  176. - Train on GPU.
  177. ```
  178. Usage: bash run_distribute_train_gpu.sh [DATA_PATH]
  179. parameters/options:
  180. DATA_PATH the storage path of dataset.
  181. ```