You can not select more than 25 topics Topics must start with a chinese character,a letter or number, can include dashes ('-') and can be up to 35 characters long.

README.md 14 kB

4 years ago
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268
  1. ![](https://www.mindspore.cn/static/img/logo_black.6a5c850d.png)
  2. <!-- TOC -->
  3. - [FastText](#fasttext)
  4. - [Model Structure](#model-structure)
  5. - [Dataset](#dataset)
  6. - [Environment Requirements](#environment-requirements)
  7. - [Quick Start](#quick-start)
  8. - [Script Description](#script-description)
  9. - [Dataset Preparation](#dataset-preparation)
  10. - [Configuration File](#configuration-file)
  11. - [Training Process](#training-process)
  12. - [Inference Process](#inference-process)
  13. - [Model Description](#model-description)
  14. - [Performance](#performance)
  15. - [Training Performance](#training-performance)
  16. - [Inference Performance](#inference-performance)
  17. - [Random Situation Description](#random-situation-description)
  18. - [Others](#others)
  19. - [ModelZoo HomePage](#modelzoo-homepage)
  20. <!-- /TOC -->
  21. # [FastText](#contents)
  22. FastText is a fast text classification algorithm, which is simple and efficient. It was proposed by Armand
  23. Joulin, Tomas Mikolov etc. in the article "Bag of Tricks for Efficient Text Classification" in 2016. It is similar to
  24. CBOW in model architecture, where the middle word is replace by a label. FastText adopts ngram feature as addition feature
  25. to get some information about words. It speeds up training and testing while maintaining high precision, and widly used
  26. in various tasks of text classification.
  27. [Paper](https://arxiv.org/pdf/1607.01759.pdf): "Bag of Tricks for Efficient Text Classification", 2016, A. Joulin, E. Grave, P. Bojanowski, and T. Mikolov
  28. # [Model Structure](#contents)
  29. The FastText model mainly consists of an input layer, hidden layer and output layer, where the input is a sequence of words (text or sentence).
  30. The output layer is probability that the words sequence belongs to different categories. The hidden layer is formed by average of multiple word vector.
  31. The feature is mapped to the hidden layer through linear transformation, and then mapped to the label from the hidden layer.
  32. # [Dataset](#contents)
  33. Note that you can run the scripts based on the dataset mentioned in original paper or widely used in relevant domain/network
  34. architecture. In the following sections, we will introduce how to run the scripts using the related dataset below.
  35. - AG's news topic classification dataset
  36. - DBPedia Ontology Classification Dataset
  37. - Yelp Review Polarity Dataset
  38. # [Environment Requirements](#content)
  39. - Hardware(Ascend)
  40. - Prepare hardware environment with Ascend processor.
  41. - Framework
  42. - [MindSpore](https://gitee.com/mindspore/mindspore)
  43. - For more information, please check the resources below:
  44. - [MindSpore Tutorials](https://www.mindspore.cn/tutorial/training/en/master/index.html)
  45. - [MindSpore Python API](https://www.mindspore.cn/doc/api_python/en/master/index.html)
  46. # [Quick Start](#content)
  47. After dataset preparation, you can start training and evaluation as follows:
  48. ```bash
  49. # run training example
  50. cd ./scripts
  51. sh run_standalone_train.sh [TRAIN_DATASET] [DEVICEID]
  52. # run distributed training example
  53. sh run_distribute_train.sh [TRAIN_DATASET] [RANK_TABLE_PATH]
  54. # run evaluation example
  55. sh run_eval.sh [EVAL_DATASET_PATH] [DATASET_NAME] [MODEL_CKPT] [DEVICEID]
  56. ```
  57. # [Script Description](#content)
  58. The FastText network script and code result are as follows:
  59. ```text
  60. ├── fasttext
  61. ├── README.md // Introduction of FastText model.
  62. ├── src
  63. │ ├──config.py // Configuration instance definition.
  64. │ ├──create_dataset.py // Dataset preparation.
  65. │ ├──fasttext_model.py // FastText model architecture.
  66. │ ├──fasttext_train.py // Use FastText model architecture.
  67. │ ├──load_dataset.py // Dataset loader to feed into model.
  68. │ ├──lr_scheduler.py // Learning rate scheduler.
  69. ├── scripts
  70. │ ├──run_distributed_train.sh // shell script for distributed train on ascend.
  71. │ ├──run_eval.sh // shell script for standalone eval on ascend.
  72. │ ├──run_standalone_train.sh // shell script for standalone eval on ascend.
  73. ├── eval.py // Infer API entry.
  74. ├── requirements.txt // Requirements of third party package.
  75. ├── train.py // Train API entry.
  76. ```
  77. ## [Dataset Preparation](#content)
  78. - Download the AG's News Topic Classification Dataset, DBPedia Ontology Classification Dataset and Yelp Review Polarity Dataset. Unzip datasets to any path you want.
  79. - Run the following scripts to do data preprocess and convert the original data to mindrecord for training and evaluation.
  80. ``` bash
  81. cd scripts
  82. sh creat_dataset.sh [SOURCE_DATASET_PATH] [DATASET_NAME]
  83. ```
  84. ## [Configuration File](#content)
  85. Parameters for both training and evaluation can be set in config.py. All the datasets are using same parameter name, parameters value could be changed according the needs.
  86. - Network Parameters
  87. ```text
  88. vocab_size # vocabulary size.
  89. buckets # bucket sequence length.
  90. test_buckets # test dataset bucket sequence length
  91. batch_size # batch size of input dataset.
  92. embedding_dims # The size of each embedding vector.
  93. num_class # number of labels.
  94. epoch # total training epochs.
  95. lr # initial learning rate.
  96. min_lr # minimum learning rate.
  97. warmup_steps # warm up steps.
  98. poly_lr_scheduler_power # a value used to calculate decayed learning rate.
  99. pretrain_ckpt_dir # pretrain checkpoint direction.
  100. keep_ckpt_max # Max ckpt files number.
  101. ```
  102. ## [Training Process](#content)
  103. - Start task training on a single device and run the shell script
  104. ```bash
  105. cd ./scripts
  106. sh run_standalone_train.sh [DATASET_PATH] [DEVICEID]
  107. ```
  108. - Running scripts for distributed training of FastText. Task training on multiple device and run the following command in bash to be executed in `scripts/`:
  109. ``` bash
  110. cd ./scripts
  111. sh run_distributed_train.sh [DATASET_PATH] [RANK_TABLE_PATH]
  112. ```
  113. ## [Inference Process](#content)
  114. - Running scripts for evaluation of FastText. The commdan as below.
  115. ``` bash
  116. cd ./scripts
  117. sh run_eval.sh [DATASET_PATH] [DATASET_NAME] [MODEL_CKPT] [DEVICEID]
  118. ```
  119. Note: The `DATASET_PATH` is path to mindrecord. eg. /dataset_path/*.mindrecord
  120. # [Model Description](#content)
  121. ## [Performance](#content)
  122. ### Training Performance
  123. | Parameters | Ascend |
  124. | -------------------------- | -------------------------------------------------------------- |
  125. | Resource | Ascend 910 |
  126. | uploaded Date | 12/21/2020 (month/day/year) |
  127. | MindSpore Version | 1.1.0 |
  128. | Dataset | AG's News Topic Classification Dataset |
  129. | Training Parameters | epoch=5, batch_size=512 |
  130. | Optimizer | Adam |
  131. | Loss Function | Softmax Cross Entropy |
  132. | outputs | probability |
  133. | Speed | 10ms/step (1pcs) |
  134. | Epoch Time | 2.36s (1pcs) |
  135. | Loss | 0.0067 |
  136. | Params (M) | 22 |
  137. | Checkpoint for inference | 254M (.ckpt file) |
  138. | Scripts | [fasttext](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/official/nlp/fasttext) |
  139. | Parameters | Ascend |
  140. | -------------------------- | -------------------------------------------------------------- |
  141. | Resource | Ascend 910 |
  142. | uploaded Date | 11/21/2020 (month/day/year) |
  143. | MindSpore Version | 1.1.0 |
  144. | Dataset | DBPedia Ontology Classification Dataset |
  145. | Training Parameters | epoch=5, batch_size=4096 |
  146. | Optimizer | Adam |
  147. | Loss Function | Softmax Cross Entropy |
  148. | outputs | probability |
  149. | Speed | 58ms/step (1pcs) |
  150. | Epoch Time | 8.15s (1pcs) |
  151. | Loss | 2.6e-4 |
  152. | Params (M) | 106 |
  153. | Checkpoint for inference | 1.2G (.ckpt file) |
  154. | Scripts | [fasttext](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/official/nlp/fasttext) |
  155. | Parameters | Ascend |
  156. | -------------------------- | -------------------------------------------------------------- |
  157. | Resource | Ascend 910 |
  158. | uploaded Date | 11/21/2020 (month/day/year) |
  159. | MindSpore Version | 1.1.0 |
  160. | Dataset | Yelp Review Polarity Dataset |
  161. | Training Parameters | epoch=5, batch_size=2048 |
  162. | Optimizer | Adam |
  163. | Loss Function | Softmax Cross Entropy |
  164. | outputs | probability |
  165. | Speed | 101ms/step (1pcs) |
  166. | Epoch Time | 28s (1pcs) |
  167. | Loss | 0.062 |
  168. | Params (M) | 103 |
  169. | Checkpoint for inference | 1.2G (.ckpt file) |
  170. | Scripts | [fasttext](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/official/nlp/fasttext) |
  171. ### Inference Performance
  172. | Parameters | Ascend |
  173. | ------------------- | --------------------------- |
  174. | Resource | Ascend 910 |
  175. | Uploaded Date | 12/21/2020 (month/day/year) |
  176. | MindSpore Version | 1.1.0 |
  177. | Dataset | AG's News Topic Classification Dataset |
  178. | batch_size | 512 |
  179. | Epoch Time | 2.36s |
  180. | outputs | label index |
  181. | Accuracy | 92.53 |
  182. | Model for inference | 254M (.ckpt file) |
  183. | Parameters | Ascend |
  184. | ------------------- | --------------------------- |
  185. | Resource | Ascend 910 |
  186. | Uploaded Date | 12/21/2020 (month/day/year) |
  187. | MindSpore Version | 1.1.0 |
  188. | Dataset | DBPedia Ontology Classification Dataset |
  189. | batch_size | 4096 |
  190. | Epoch Time | 8.15s |
  191. | outputs | label index |
  192. | Accuracy | 98.6 |
  193. | Model for inference | 1.2G (.ckpt file) |
  194. | Parameters | Ascend |
  195. | ------------------- | --------------------------- |
  196. | Resource | Ascend 910 |
  197. | Uploaded Date | 12/21/2020 (month/day/year) |
  198. | MindSpore Version | 1.1.0 |
  199. | Dataset | Yelp Review Polarity Dataset |
  200. | batch_size | 2048 |
  201. | Epoch Time | 28s |
  202. | outputs | label index |
  203. | Accuracy | 95.7 |
  204. | Model for inference | 1.2G (.ckpt file) |
  205. # [Random Situation Description](#content)
  206. There only one random situation.
  207. - Initialization of some model weights.
  208. Some seeds have already been set in train.py to avoid the randomness of weight initialization.
  209. # [Others](#others)
  210. This model has been validated in the Ascend environment and is not validated on the CPU and GPU.
  211. # [ModelZoo HomePage](#contents)
  212. Please check the official [homepage](https://gitee.com/mindspore/mindspore/tree/master/model_zoo)