You can not select more than 25 topics Topics must start with a chinese character,a letter or number, can include dashes ('-') and can be up to 35 characters long.

README.md 13 kB

5 years ago
5 years ago
5 years ago
5 years ago
5 years ago
5 years ago
5 years ago
5 years ago
5 years ago
5 years ago
5 years ago
5 years ago
5 years ago
5 years ago
5 years ago
5 years ago
5 years ago
5 years ago
5 years ago
5 years ago
5 years ago
5 years ago
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320
  1. # Contents
  2. - [Contents](#contents)
  3. - [Xception Description](#xception-description)
  4. - [Model architecture](#model-architecture)
  5. - [Dataset](#dataset)
  6. - [Features](#features)
  7. - [Mixed Precision](#mixed-precisionascend)
  8. - [Environment Requirements](#environment-requirements)
  9. - [Script description](#script-description)
  10. - [Script and sample code](#script-and-sample-code)
  11. - [Script Parameters](#script-parameters)
  12. - [Training process](#training-process)
  13. - [Usage](#usage)
  14. - [Launch](#launch)
  15. - [Result](#result)
  16. - [Eval process](#eval-process)
  17. - [Usage](#usage-1)
  18. - [Launch](#launch-1)
  19. - [Result](#result-1)
  20. - [Model description](#model-description)
  21. - [Performance](#performance)
  22. - [Training Performance](#training-performance)
  23. - [Inference Performance](#inference-performance)
  24. - [Description of Random Situation](#description-of-random-situation)
  25. - [ModelZoo Homepage](#modelzoo-homepage)
  26. # [Xception Description](#contents)
  27. Xception by Google is extreme version of Inception. With a modified depthwise separable convolution, it is even better than Inception-v3. This paper was published in 2017.
  28. [Paper](https://arxiv.org/pdf/1610.02357v3.pdf) Franois Chollet. Xception: Deep Learning with Depthwise Separable Convolutions. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) IEEE, 2017.
  29. # [Model architecture](#contents)
  30. The overall network architecture of Xception is show below:
  31. [Link](https://arxiv.org/pdf/1610.02357v3.pdf)
  32. # [Dataset](#contents)
  33. Dataset used can refer to paper.
  34. - Dataset size: 125G, 1250k colorful images in 1000 classes
  35. - Train: 120G, 1200k images
  36. - Test: 5G, 50k images
  37. - Data format: RGB images.
  38. - Note: Data will be processed in src/dataset.py
  39. # [Features](#contents)
  40. ## [Mixed Precision](#contents)
  41. The [mixed precision](https://www.mindspore.cn/tutorial/training/en/master/advanced_use/enable_mixed_precision.html) training method accelerates the deep learning neural network training process by using both the single-precision and half-precision data formats, and maintains the network precision achieved by the single-precision training at the same time. Mixed precision training can accelerate the computation process, reduce memory usage, and enable a larger model or batch size to be trained on specific hardware.
  42. For FP16 operators, if the input data type is FP32, the backend of MindSpore will automatically handle it with reduced precision. Users could check the reduced-precision operators by enabling INFO log and then searching ‘reduce precision’.
  43. # [Environment Requirements](#contents)
  44. - Hardware(Ascend/GPU)
  45. - Prepare hardware environment with Ascend or GPU processor.
  46. - Framework
  47. - [MindSpore](https://www.mindspore.cn/install/en)
  48. - For more information, please check the resources below:
  49. - [MindSpore Tutorials](https://www.mindspore.cn/tutorial/training/en/master/index.html)
  50. - [MindSpore Python API](https://www.mindspore.cn/doc/api_python/en/master/index.html)
  51. # [Script description](#contents)
  52. ## [Script and sample code](#contents)
  53. ```shell
  54. .
  55. └─Xception
  56. ├─README.md
  57. ├─scripts
  58. ├─run_standalone_train.sh # launch standalone training with ascend platform(1p)
  59. ├─run_distribute_train.sh # launch distributed training with ascend platform(8p)
  60. ├─run_train_gpu_fp32.sh # launch standalone or distributed fp32 training with gpu platform(1p or 8p)
  61. ├─run_train_gpu_fp16.sh # launch standalone or distributed fp16 training with gpu platform(1p or 8p)
  62. ├─run_eval.sh # launch evaluating with ascend platform
  63. └─run_eval_gpu.sh # launch evaluating with gpu platform
  64. ├─src
  65. ├─config.py # parameter configuration
  66. ├─dataset.py # data preprocessing
  67. ├─Xception.py # network definition
  68. ├─loss.py # Customized CrossEntropy loss function
  69. └─lr_generator.py # learning rate generator
  70. ├─train.py # train net
  71. ├─export.py # export net
  72. └─eval.py # eval net
  73. ```
  74. ## [Script Parameters](#contents)
  75. Parameters for both training and evaluation can be set in config.py.
  76. - Config on ascend
  77. ```python
  78. Major parameters in train.py and config.py are:
  79. 'num_classes': 1000 # dataset class numbers
  80. 'batch_size': 128 # input batchsize
  81. 'loss_scale': 1024 # loss scale
  82. 'momentum': 0.9 # momentum
  83. 'weight_decay': 1e-4 # weight decay
  84. 'epoch_size': 250 # total epoch numbers
  85. 'save_checkpoint': True # save checkpoint
  86. 'save_checkpoint_epochs': 1 # save checkpoint epochs
  87. 'keep_checkpoint_max': 5 # max numbers to keep checkpoints
  88. 'save_checkpoint_path': "./" # save checkpoint path
  89. 'warmup_epochs': 1 # warmup epoch numbers
  90. 'lr_decay_mode': "liner" # lr decay mode
  91. 'use_label_smooth': True # use label smooth
  92. 'finish_epoch': 0 # finished epochs numbers
  93. 'label_smooth_factor': 0.1 # label smoothing factor
  94. 'lr_init': 0.00004 # initiate learning rate
  95. 'lr_max': 0.4 # max bound of learning rate
  96. 'lr_end': 0.00004 # min bound of learning rate
  97. ```
  98. - Config on gpu
  99. ```python
  100. Major parameters in train.py and config.py are:
  101. 'num_classes': 1000 # dataset class numbers
  102. 'batch_size': 64 # input batchsize
  103. 'loss_scale': 1024 # loss scale
  104. 'momentum': 0.9 # momentum
  105. 'weight_decay': 1e-4 # weight decay
  106. 'epoch_size': 250 # total epoch numbers
  107. 'save_checkpoint': True # save checkpoint
  108. 'save_checkpoint_epochs': 1 # save checkpoint epochs
  109. 'keep_checkpoint_max': 5 # max numbers to keep checkpoints
  110. 'save_checkpoint_path': "./gpu-ckpt" # save checkpoint path
  111. 'warmup_epochs': 1 # warmup epoch numbers
  112. 'lr_decay_mode': "linear" # lr decay mode
  113. 'use_label_smooth': True # use label smooth
  114. 'finish_epoch': 0 # finished epochs numbers
  115. 'label_smooth_factor': 0.1 # label smoothing factor
  116. 'lr_init': 0.00004 # initiate learning rate
  117. 'lr_max': 0.4 # max bound of learning rate
  118. 'lr_end': 0.00004 # min bound of learning rate
  119. ```
  120. ## [Training process](#contents)
  121. ### Usage
  122. You can start training using python or shell scripts. The usage of shell scripts as follows:
  123. - Ascend:
  124. ```shell
  125. # distribute training example(8p)
  126. sh scripts/run_distribute_train.sh RANK_TABLE_FILE DATA_PATH
  127. # standalone training
  128. sh scripts/run_standalone_train.sh DEVICE_ID DATA_PATH
  129. ```
  130. - GPU:
  131. ```shell
  132. # fp32 distributed training example(8p)
  133. sh scripts/run_train_gpu_fp32.sh DEVICE_NUM DATASET_PATH PRETRAINED_CKPT_PATH(optional)
  134. # fp32 standalone training example
  135. sh scripts/run_train_gpu_fp32.sh 1 DATASET_PATH PRETRAINED_CKPT_PATH(optional)
  136. # fp16 distributed training example(8p)
  137. sh scripts/run_train_gpu_fp16.sh DEVICE_NUM DATASET_PATH PRETRAINED_CKPT_PATH(optional)
  138. # fp16 standalone training example
  139. sh scripts/run_train_gpu_fp16.sh 1 DATASET_PATH PRETRAINED_CKPT_PATH(optional)
  140. # infer example
  141. sh run_eval_gpu.sh DEVICE_ID DATASET_PATH CHECKPOINT_PATH
  142. ```
  143. > Notes: RANK_TABLE_FILE can refer to [Link](https://www.mindspore.cn/tutorial/training/en/master/advanced_use/distributed_training_ascend.html), and the device_ip can be got as [Link](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/utils/hccl_tools).
  144. ### Launch
  145. ``` shell
  146. # training example
  147. python:
  148. Ascend:
  149. python train.py --device_target Ascend --dataset_path /dataset/train
  150. GPU:
  151. python train.py --device_target GPU --dataset_path /dataset/train
  152. shell:
  153. Ascend:
  154. # distribute training example(8p)
  155. sh scripts/run_distribute_train.sh RANK_TABLE_FILE DATA_PATH
  156. # standalone training
  157. sh scripts/run_standalone_train.sh DEVICE_ID DATA_PATH
  158. GPU:
  159. # fp16 training example(8p)
  160. sh scripts/run_train_gpu_fp16.sh DEVICE_NUM DATA_PATH
  161. # fp32 training example(8p)
  162. sh scripts/run_train_gpu_fp32.sh DEVICE_NUM DATA_PATH
  163. ```
  164. ### Result
  165. Training result will be stored in the example path. Checkpoints will be stored at `./ckpt_0` for Ascend and `./gpu_ckpt` for GPU by default, and training log will be redirected to `log.txt` fo Ascend and `log_gpu.txt` for GPU like following.
  166. - Ascend:
  167. ``` shell
  168. epoch: 1 step: 1251, loss is 4.8427444
  169. epoch time: 701242.350 ms, per step time: 560.545 ms
  170. epoch: 2 step: 1251, loss is 4.0637593
  171. epoch time: 598591.422 ms, per step time: 478.490 ms
  172. ```
  173. - GPU:
  174. ``` shell
  175. epoch: 1 step: 20018, loss is 5.479554
  176. epoch time: 5664051.330 ms, per step time: 282.948 ms
  177. epoch: 2 step: 20018, loss is 5.179064
  178. epoch time: 5628609.779 ms, per step time: 281.177 ms
  179. ```
  180. ## [Eval process](#contents)
  181. ### Usage
  182. You can start training using python or shell scripts. The usage of shell scripts as follows:
  183. - Ascend:
  184. ```shell
  185. sh scripts/run_eval.sh DEVICE_ID DATA_DIR PATH_CHECKPOINT
  186. ```
  187. - GPU:
  188. ```shell
  189. sh scripts/run_eval_gpu.sh DEVICE_ID DATA_DIR PATH_CHECKPOINT
  190. ```
  191. ### Launch
  192. ```shell
  193. # eval example
  194. python:
  195. Ascend: python eval.py --device_target Ascend --checkpoint_path PATH_CHECKPOINT --dataset_path DATA_DIR
  196. GPU: python eval.py --device_target GPU --checkpoint_path PATH_CHECKPOINT --dataset_path DATA_DIR
  197. shell:
  198. Ascend: sh scripts/run_eval.sh DEVICE_ID DATA_DIR PATH_CHECKPOINT
  199. GPU: sh scripts/run_eval_gpu.sh DEVICE_ID DATA_DIR PATH_CHECKPOINT
  200. ```
  201. > checkpoint can be produced in training process.
  202. ### Result
  203. Evaluation result will be stored in the example path, you can find result like the following in `eval.log` on ascend and `eval_gpu.log` on gpu.
  204. - Evaluating with ascend
  205. ```shell
  206. result: {'Loss': 1.7797744848789312, 'Top_1_Acc': 0.7985777243589743, 'Top_5_Acc': 0.9485777243589744}
  207. ```
  208. - Evaluating with gpu
  209. ```shell
  210. result: {'Loss': 1.7846775874590903, 'Top_1_Acc': 0.798735595390525, 'Top_5_Acc': 0.9498439500640204}
  211. ```
  212. # [Model description](#contents)
  213. ## [Performance](#contents)
  214. ### Training Performance
  215. | Parameters | Ascend | GPU |
  216. | -------------------------- | ------------------------- | ------------------------- |
  217. | Model Version | Xception | Xception |
  218. | Resource | HUAWEI CLOUD Modelarts | HUAWEI CLOUD Modelarts |
  219. | uploaded Date | 12/10/2020 | 02/09/2021 |
  220. | MindSpore Version | 1.1.0 | 1.1.0 |
  221. | Dataset | 1200k images | 1200k images |
  222. | Batch_size | 128 | 64 |
  223. | Training Parameters | src/config.py | src/config.py |
  224. | Optimizer | Momentum | Momentum |
  225. | Loss Function | CrossEntropySmooth | CrossEntropySmooth |
  226. | Loss | 1.78 | 1.78 |
  227. | Accuracy (8p) | Top1[79.8%] Top5[94.8%] | Top1[79.8%] Top5[94.9%] |
  228. | Per step time (8p) | 479 ms/step | 282 ms/step |
  229. | Total time (8p) | 42h | 51h |
  230. | Params (M) | 180M | 180M |
  231. | Scripts | [Xception script](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/official/cv/xception) | [Xception script](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/official/cv/xception) |
  232. #### Inference Performance
  233. | Parameters | Ascend | GPU |
  234. | ------------------- | --------------------------- | --------------------------- |
  235. | Model Version | Xception | Xception |
  236. | Resource | HUAWEI CLOUD Modelarts | HUAWEI CLOUD Modelarts |
  237. | Uploaded Date | 12/10/2020 | 02/09/2021 |
  238. | MindSpore Version | 1.1.0 | 1.1.0 |
  239. | Dataset | 50k images | 50k images |
  240. | Batch_size | 128 | 64 |
  241. | Accuracy | Top1[79.8%] Top5[94.8%] | Top1[79.8%] Top5[94.9%] |
  242. | Total time | 3mins | 4.7mins |
  243. # [Description of Random Situation](#contents)
  244. In `dataset.py`, we set the seed inside `create_dataset` function. We also use random seed in `train.py`.
  245. # [ModelZoo Homepage](#contents)
  246. Please check the official [homepage](https://gitee.com/mindspore/mindspore/tree/master/model_zoo).