You can not select more than 25 topics Topics must start with a chinese character,a letter or number, can include dashes ('-') and can be up to 35 characters long.

README.md 15 kB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324
  1. # Contents
  2. - [GoogleNet Description](#googlenet-description)
  3. - [Model Architecture](#model-architecture)
  4. - [Dataset](#dataset)
  5. - [Features](#features)
  6. - [Mixed Precision](#mixed-precision)
  7. - [Environment Requirements](#environment-requirements)
  8. - [Quick Start](#quick-start)
  9. - [Script Description](#script-description)
  10. - [Script and Sample Code](#script-and-sample-code)
  11. - [Script Parameters](#script-parameters)
  12. - [Training Process](#training-process)
  13. - [Training](#training)
  14. - [Distributed Training](#distributed-training)
  15. - [Evaluation Process](#evaluation-process)
  16. - [Evaluation](#evaluation)
  17. - [Model Description](#model-description)
  18. - [Performance](#performance)
  19. - [Evaluation Performance](#evaluation-performance)
  20. - [Inference Performance](#evaluation-performance)
  21. - [How to use](#how-to-use)
  22. - [Inference](#inference)
  23. - [Continue Training on the Pretrained Model](#continue-training-on-the-pretrained-model)
  24. - [Transfer Learning](#transfer-learning)
  25. - [Description of Random Situation](#description-of-random-situation)
  26. - [ModelZoo Homepage](#modelzoo-homepage)
  27. # [GoogleNet Description](#contents)
  28. GoogleNet, a 22 layers deep network, was proposed in 2014 and won the first place in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14). GoogleNet, also called Inception v1, has significant improvement over ZFNet (The winner in 2013) and AlexNet (The winner in 2012), and has relatively lower error rate compared to VGGNet. Typically deeper deep learning network means larger number of parameters, which makes it more prone to overfitting. Furthermore, the increased network size leads to increased use of computational resources. To tackle these issues, GoogleNet adopts 1*1 convolution middle of the network to reduce dimension, and thus further reduce the computation. Global average pooling is used at the end of the network, instead of using fully connected layers. Another technique, called inception module, is to have different sizes of convolutions for the same input and stacking all the outputs.
  29. [Paper](https://arxiv.org/abs/1409.4842): Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, Andrew Rabinovich. "Going deeper with convolutions." *Proceedings of the IEEE conference on computer vision and pattern recognition*. 2015.
  30. # [Model Architecture](#contents)
  31. The overall network architecture of GoogleNet is shown below:
  32. ![](https://miro.medium.com/max/3780/1*ZFPOSAted10TPd3hBQU8iQ.png)
  33. Specifically, the GoogleNet contains numerous inception modules, which are connected together to go deeper. In general, an inception module with dimensionality reduction consists of **1×1 conv**, **3×3 conv**, **5×5 conv**, and **3×3 max pooling**, which are done altogether for the previous input, and stack together again at output.
  34. ![](https://miro.medium.com/max/1108/1*sezFsYW1MyM9YOMa1q909A.png)
  35. # [Dataset](#contents)
  36. Dataset used: [CIFAR-10](<http://www.cs.toronto.edu/~kriz/cifar.html>)
  37. - Dataset size:175M,60,000 32*32 colorful images in 10 classes
  38. - Train:146M,50,000 images
  39. - Test:29.3M,10,000 images
  40. - Data format:binary files
  41. - Note:Data will be processed in dataset.py
  42. # [Features](#contents)
  43. ## Mixed Precision
  44. The [mixed precision](https://www.mindspore.cn/tutorial/zh-CN/master/advanced_use/mixed_precision.html) training method accelerates the deep learning neural network training process by using both the single-precision and half-precision data formats, and maintains the network precision achieved by the single-precision training at the same time. Mixed precision training can accelerate the computation process, reduce memory usage, and enable a larger model or batch size to be trained on specific hardware.
  45. For FP16 operators, if the input data type is FP32, the backend of MindSpore will automatically handle it with reduced precision. Users could check the reduced-precision operators by enabling INFO log and then searching ‘reduce precision’.
  46. # [Environment Requirements](#contents)
  47. - Hardware(Ascend/GPU)
  48. - Prepare hardware environment with Ascend or GPU processor. If you want to try Ascend , please send the [application form](https://obs-9be7.obs.cn-east-2.myhuaweicloud.com/file/other/Ascend%20Model%20Zoo%E4%BD%93%E9%AA%8C%E8%B5%84%E6%BA%90%E7%94%B3%E8%AF%B7%E8%A1%A8.docx) to ascend@huawei.com. Once approved, you can get the resources.
  49. - Framework
  50. - [MindSpore](http://10.90.67.50/mindspore/archive/20200506/OpenSource/me_vm_x86/)
  51. - For more information, please check the resources below:
  52. - [MindSpore tutorials](https://www.mindspore.cn/tutorial/zh-CN/master/index.html)
  53. - [MindSpore API](https://www.mindspore.cn/api/zh-CN/master/index.html)
  54. # [Quick Start](#contents)
  55. After installing MindSpore via the official website, you can start training and evaluation as follows:
  56. ```python
  57. # run training example
  58. python train.py > train.log 2>&1 &
  59. # run distributed training example
  60. sh scripts/run_train.sh rank_table.json
  61. # run evaluation example
  62. python eval.py > eval.log 2>&1 & OR sh run_eval.sh
  63. ```
  64. # [Script Description](#contents)
  65. ## [Script and Sample Code](#contents)
  66. ```
  67. ├── model_zoo
  68. ├── README.md // descriptions about all the models
  69. ├── googlenet
  70. ├── README.md // descriptions about googlenet
  71. ├── scripts
  72. │ ├──run_train.sh // shell script for distributed
  73. │ ├──run_eval.sh // shell script for evaluation
  74. ├── src
  75. │ ├──dataset.py // creating dataset
  76. │ ├──googlenet.py // googlenet architecture
  77. │ ├──config.py // parameter configuration
  78. ├── train.py // training script
  79. ├── eval.py // evaluation script
  80. ├── export.py // export checkpoint files into geir/onnx
  81. ```
  82. ## [Script Parameters](#contents)
  83. ```python
  84. Major parameters in train.py and config.py are:
  85. --data_path: The absolute full path to the train and evaluation datasets.
  86. --epoch_size: Total training epochs.
  87. --batch_size: Training batch size.
  88. --lr_init: Initial learning rate.
  89. --num_classes: The number of classes in the training set.
  90. --weight_decay: Weight decay value.
  91. --image_height: Image height used as input to the model.
  92. --image_width: Image width used as input the model.
  93. --pre_trained: Whether training from scratch or training based on the
  94. pre-trained model.Optional values are True, False.
  95. --device_target: Device where the code will be implemented. Optional values
  96. are "Ascend", "GPU".
  97. --device_id: Device ID used to train or evaluate the dataset. Ignore it
  98. when you use run_train.sh for distributed training.
  99. --checkpoint_path: The absolute full path to the checkpoint file saved
  100. after training.
  101. --onnx_filename: File name of the onnx model used in export.py.
  102. --geir_filename: File name of the geir model used in export.py.
  103. ```
  104. ## [Training Process](#contents)
  105. ### Training
  106. ```
  107. python train.py > train.log 2>&1 &
  108. ```
  109. The python command above will run in the background, you can view the results through the file `train.log`.
  110. After training, you'll get some checkpoint files under the script folder by default. The loss value will be achieved as follows:
  111. ```
  112. # grep "loss is " train.log
  113. epoch: 1 step: 390, loss is 1.4842823
  114. epcoh: 2 step: 390, loss is 1.0897788
  115. ...
  116. ```
  117. The model checkpoint will be saved in the current directory.
  118. ### Distributed Training
  119. ```
  120. sh scripts/run_train.sh rank_table.json
  121. ```
  122. The above shell script will run distribute training in the background. You can view the results through the file `train_parallel[X]/log`. The loss value will be achieved as follows:
  123. ```
  124. # grep "result: " train_parallel*/log
  125. train_parallel0/log:epoch: 1 step: 48, loss is 1.4302931
  126. train_parallel0/log:epcoh: 2 step: 48, loss is 1.4023874
  127. ...
  128. train_parallel1/log:epoch: 1 step: 48, loss is 1.3458025
  129. train_parallel1/log:epcoh: 2 step: 48, loss is 1.3729336
  130. ...
  131. ...
  132. ```
  133. ## [Evaluation Process](#contents)
  134. ### Evaluation
  135. Before running the command below, please check the checkpoint path used for evaluation. Please set the checkpoint path to be the absolute full path, e.g., "username/googlenet/train_googlenet_cifar10-125_390.ckpt".
  136. ```
  137. python eval.py > eval.log 2>&1 &
  138. OR
  139. sh scripts/run_eval.sh
  140. ```
  141. The above python command will run in the background. You can view the results through the file "eval.log". The accuracy of the test dataset will be as follows:
  142. ```
  143. # grep "accuracy: " eval.log
  144. accuracy: {'acc': 0.934}
  145. ```
  146. Note that for evaluation after distributed training, please set the checkpoint_path to be the last saved checkpoint file such as "username/googlenet/train_parallel0/train_googlenet_cifar10-125_48.ckpt". The accuracy of the test dataset will be as follows:
  147. ```
  148. # grep "accuracy: " dist.eval.log
  149. accuracy: {'acc': 0.9217}
  150. ```
  151. # [Model Description](#contents)
  152. ## [Performance](#contents)
  153. ### Evaluation Performance
  154. | Parameters | GoogleNet |
  155. | -------------------------- | ----------------------------------------------------------- |
  156. | Model Version | Inception V1 |
  157. | Resource | Ascend 910 ;CPU 2.60GHz,56cores;Memory,314G |
  158. | uploaded Date | 06/09/2020 (month/day/year) |
  159. | MindSpore Version | 0.3.0-alpha |
  160. | Dataset | CIFAR-10 |
  161. | Training Parameters | epoch=125, steps=390, batch_size = 128, lr=0.1 |
  162. | Optimizer | SGD |
  163. | Loss Function | Softmax Cross Entropy |
  164. | outputs | probability |
  165. | Loss | 0.0016 |
  166. | Speed | 1pc: 79 ms/step; 8pcs: 82 ms/step |
  167. | Total time | 1pc: 63.85 mins; 8pcs: 11.28 mins |
  168. | Parameters (M) | 6.8 |
  169. | Checkpoint for Fine tuning | 43.07M (.ckpt file) |
  170. | Model for inference | 21.50M (.onnx file), 21.60M(.geir file) |
  171. | Scripts | https://gitee.com/mindspore/mindspore/tree/master/model_zoo/googlenet |
  172. ### Inference Performance
  173. | Parameters | GoogleNet |
  174. | ------------------- | --------------------------- |
  175. | Model Version | Inception V1 |
  176. | Resource | Ascend 910 |
  177. | Uploaded Date | 06/09/2020 (month/day/year) |
  178. | MindSpore Version | 0.3.0-alpha |
  179. | Dataset | CIFAR-10, 10,000 images |
  180. | batch_size | 128 |
  181. | outputs | probability |
  182. | Accuracy | 1pc: 93.4%; 8pcs: 92.17% |
  183. | Model for inference | 21.50M (.onnx file) |
  184. ## [How to use](#contents)
  185. ### Inference
  186. If you need to use the trained model to perform inference on multiple hardware platforms, such as GPU, Ascend 910 or Ascend 310, you can refer to this [Link](https://www.mindspore.cn/tutorial/zh-CN/master/advanced_use/network_migration.html). Following the steps below, this is a simple example:
  187. ```
  188. # Load unseen dataset for inference
  189. dataset = dataset.create_dataset(cfg.data_path, 1, False)
  190. # Define model
  191. net = GoogleNet(num_classes=cfg.num_classes)
  192. opt = Momentum(filter(lambda x: x.requires_grad, net.get_parameters()), 0.01,
  193. cfg.momentum, weight_decay=cfg.weight_decay)
  194. loss = nn.SoftmaxCrossEntropyWithLogits(sparse=True, reduction='mean',
  195. is_grad=False)
  196. model = Model(net, loss_fn=loss, optimizer=opt, metrics={'acc'})
  197. # Load pre-trained model
  198. param_dict = load_checkpoint(cfg.checkpoint_path)
  199. load_param_into_net(net, param_dict)
  200. net.set_train(False)
  201. # Make predictions on the unseen dataset
  202. acc = model.eval(dataset)
  203. print("accuracy: ", acc)
  204. ```
  205. ### Continue Training on the Pretrained Model
  206. ```
  207. # Load dataset
  208. dataset = create_dataset(cfg.data_path, cfg.epoch_size)
  209. batch_num = dataset.get_dataset_size()
  210. # Define model
  211. net = GoogleNet(num_classes=cfg.num_classes)
  212. # Continue training if set pre_trained to be True
  213. if cfg.pre_trained:
  214. param_dict = load_checkpoint(cfg.checkpoint_path)
  215. load_param_into_net(net, param_dict)
  216. lr = lr_steps(0, lr_max=cfg.lr_init, total_epochs=cfg.epoch_size,
  217. steps_per_epoch=batch_num)
  218. opt = Momentum(filter(lambda x: x.requires_grad, net.get_parameters()),
  219. Tensor(lr), cfg.momentum, weight_decay=cfg.weight_decay)
  220. loss = nn.SoftmaxCrossEntropyWithLogits(sparse=True, reduction='mean', is_grad=False)
  221. model = Model(net, loss_fn=loss, optimizer=opt, metrics={'acc'},
  222. amp_level="O2", keep_batchnorm_fp32=False, loss_scale_manager=None)
  223. # Set callbacks
  224. config_ck = CheckpointConfig(save_checkpoint_steps=batch_num * 5,
  225. keep_checkpoint_max=cfg.keep_checkpoint_max)
  226. time_cb = TimeMonitor(data_size=batch_num)
  227. ckpoint_cb = ModelCheckpoint(prefix="train_googlenet_cifar10", directory="./",
  228. config=config_ck)
  229. loss_cb = LossMonitor()
  230. # Start training
  231. model.train(cfg.epoch_size, dataset, callbacks=[time_cb, ckpoint_cb, loss_cb])
  232. print("train success")
  233. ```
  234. ### Transfer Learning
  235. To be added.
  236. # [Description of Random Situation](#contents)
  237. In dataset.py, we set the seed inside “create_dataset" function. We also use random seed in train.py.
  238. # [ModelZoo Homepage](#contents)
  239. Please check the official [homepage](https://gitee.com/mindspore/mindspore/tree/master/model_zoo).