|
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106 |
- # VGG16 Example
-
- ## Description
-
- This example is for VGG16 model training and evaluation.
-
- ## Requirements
-
- - Install [MindSpore](https://www.mindspore.cn/install/en).
-
- - Download the CIFAR-10 binary version dataset.
-
- > Unzip the CIFAR-10 dataset to any path you want and the folder structure should be as follows:
- > ```
- > .
- > ├── cifar-10-batches-bin # train dataset
- > └── cifar-10-verify-bin # infer dataset
- > ```
-
- ## Running the Example
-
- ### Training
-
- ```
- python train.py --data_path=your_data_path --device_id=6 > out.train.log 2>&1 &
- ```
- The python command above will run in the background, you can view the results through the file `out.train.log`.
-
- After training, you'll get some checkpoint files under the script folder by default.
-
- You will get the loss value as following:
- ```
- # grep "loss is " out.train.log
- epoch: 1 step: 781, loss is 2.093086
- epcoh: 2 step: 781, loss is 1.827582
- ...
- ```
-
- ### Evaluation
-
- ```
- python eval.py --data_path=your_data_path --device_id=6 --checkpoint_path=./train_vgg_cifar10-70-781.ckpt > out.eval.log 2>&1 &
- ```
- The above python command will run in the background, you can view the results through the file `out.eval.log`.
-
- You will get the accuracy as following:
- ```
- # grep "result: " out.eval.log
- result: {'acc': 0.92}
- ```
-
- ### Distribute Training
- ```
- sh run_distribute_train.sh rank_table.json your_data_path
- ```
- The above shell script will run distribute training in the background, you can view the results through the file `train_parallel[X]/log`.
-
- You will get the loss value as following:
- ```
- # grep "result: " train_parallel*/log
- train_parallel0/log:epoch: 1 step: 97, loss is 1.9060308
- train_parallel0/log:epcoh: 2 step: 97, loss is 1.6003821
- ...
- train_parallel1/log:epoch: 1 step: 97, loss is 1.7095519
- train_parallel1/log:epcoh: 2 step: 97, loss is 1.7133579
- ...
- ...
- ```
- > About rank_table.json, you can refer to the [distributed training tutorial](https://www.mindspore.cn/tutorial/en/master/advanced_use/distributed_training.html).
-
- ## Usage:
-
- ### Training
- ```
- usage: train.py [--device_target TARGET][--data_path DATA_PATH]
- [--device_id DEVICE_ID]
-
- parameters/options:
- --device_target the training backend type, default is Ascend.
- --data_path the storage path of dataset
- --device_id the device which used to train model.
-
- ```
-
- ### Evaluation
-
- ```
- usage: eval.py [--device_target TARGET][--data_path DATA_PATH]
- [--device_id DEVICE_ID][--checkpoint_path CKPT_PATH]
-
- parameters/options:
- --device_target the evaluation backend type, default is Ascend.
- --data_path the storage path of datasetd
- --device_id the device which used to evaluate model.
- --checkpoint_path the checkpoint file path used to evaluate model.
- ```
-
- ### Distribute Training
-
- ```
- Usage: sh run_distribute_train.sh [MINDSPORE_HCCL_CONFIG_PATH] [DATA_PATH]
-
- parameters/options:
- MINDSPORE_HCCL_CONFIG_PATH HCCL configuration file path.
- DATA_PATH the storage path of dataset.
- ```
|