You can not select more than 25 topics Topics must start with a chinese character,a letter or number, can include dashes ('-') and can be up to 35 characters long.

README.md 3.0 kB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106
  1. # VGG16 Example
  2. ## Description
  3. This example is for VGG16 model training and evaluation.
  4. ## Requirements
  5. - Install [MindSpore](https://www.mindspore.cn/install/en).
  6. - Download the CIFAR-10 binary version dataset.
  7. > Unzip the CIFAR-10 dataset to any path you want and the folder structure should be as follows:
  8. > ```
  9. > .
  10. > ├── cifar-10-batches-bin # train dataset
  11. > └── cifar-10-verify-bin # infer dataset
  12. > ```
  13. ## Running the Example
  14. ### Training
  15. ```
  16. python train.py --data_path=your_data_path --device_id=6 > out.train.log 2>&1 &
  17. ```
  18. The python command above will run in the background, you can view the results through the file `out.train.log`.
  19. After training, you'll get some checkpoint files under the script folder by default.
  20. You will get the loss value as following:
  21. ```
  22. # grep "loss is " out.train.log
  23. epoch: 1 step: 781, loss is 2.093086
  24. epcoh: 2 step: 781, loss is 1.827582
  25. ...
  26. ```
  27. ### Evaluation
  28. ```
  29. python eval.py --data_path=your_data_path --device_id=6 --checkpoint_path=./train_vgg_cifar10-70-781.ckpt > out.eval.log 2>&1 &
  30. ```
  31. The above python command will run in the background, you can view the results through the file `out.eval.log`.
  32. You will get the accuracy as following:
  33. ```
  34. # grep "result: " out.eval.log
  35. result: {'acc': 0.92}
  36. ```
  37. ### Distribute Training
  38. ```
  39. sh run_distribute_train.sh rank_table.json your_data_path
  40. ```
  41. The above shell script will run distribute training in the background, you can view the results through the file `train_parallel[X]/log`.
  42. You will get the loss value as following:
  43. ```
  44. # grep "result: " train_parallel*/log
  45. train_parallel0/log:epoch: 1 step: 97, loss is 1.9060308
  46. train_parallel0/log:epcoh: 2 step: 97, loss is 1.6003821
  47. ...
  48. train_parallel1/log:epoch: 1 step: 97, loss is 1.7095519
  49. train_parallel1/log:epcoh: 2 step: 97, loss is 1.7133579
  50. ...
  51. ...
  52. ```
  53. > About rank_table.json, you can refer to the [distributed training tutorial](https://www.mindspore.cn/tutorial/en/master/advanced_use/distributed_training.html).
  54. ## Usage:
  55. ### Training
  56. ```
  57. usage: train.py [--device_target TARGET][--data_path DATA_PATH]
  58. [--device_id DEVICE_ID]
  59. parameters/options:
  60. --device_target the training backend type, default is Ascend.
  61. --data_path the storage path of dataset
  62. --device_id the device which used to train model.
  63. ```
  64. ### Evaluation
  65. ```
  66. usage: eval.py [--device_target TARGET][--data_path DATA_PATH]
  67. [--device_id DEVICE_ID][--checkpoint_path CKPT_PATH]
  68. parameters/options:
  69. --device_target the evaluation backend type, default is Ascend.
  70. --data_path the storage path of datasetd
  71. --device_id the device which used to evaluate model.
  72. --checkpoint_path the checkpoint file path used to evaluate model.
  73. ```
  74. ### Distribute Training
  75. ```
  76. Usage: sh run_distribute_train.sh [MINDSPORE_HCCL_CONFIG_PATH] [DATA_PATH]
  77. parameters/options:
  78. MINDSPORE_HCCL_CONFIG_PATH HCCL configuration file path.
  79. DATA_PATH the storage path of dataset.
  80. ```