You can not select more than 25 topics Topics must start with a chinese character,a letter or number, can include dashes ('-') and can be up to 35 characters long.

README.md 14 kB

4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297
  1. # Contents
  2. - [DeepSpeech2 Description](#CenterNet-description)
  3. - [Model Architecture](#model-architecture)
  4. - [Dataset](#dataset)
  5. - [Environment Requirements](#environment-requirements)
  6. - [Script Description](#script-description)
  7. - [Script and Sample Code](#script-and-sample-code)
  8. - [Script Parameters](#script-parameters)
  9. - [Training and eval Process](#training-process)
  10. - [Export MindIR](#convert-process)
  11. - [Convert](#convert)
  12. - [Model Description](#model-description)
  13. - [Performance](#performance)
  14. - [Training Performance](#training-performance)
  15. - [Inference Performance](#inference-performance)
  16. - [ModelZoo Homepage](#modelzoo-homepage)
  17. # [DeepSpeech2 Description](#contents)
  18. DeepSpeech2 is a speech recognition models which is trained with CTC loss. It replaces entire pipelines of hand-engineered components with neural networks and can handle a diverse variety of speech including noisy
  19. environments, accents and different languages. We support training and evaluation on CPU and GPU.
  20. [Paper](https://arxiv.org/pdf/1512.02595v1.pdf): Amodei, Dario, et al. Deep speech 2: End-to-end speech recognition in english and mandarin.
  21. # [Model Architecture](#contents)
  22. The current reproduced model consists of:
  23. - two convolutional layers:
  24. - number of channels is 32, kernel size is [41, 11], stride is [2, 2]
  25. - number of channels is 32, kernel size is [41, 11], stride is [2, 1]
  26. - five bidirectional LSTM layers (size is 1024)
  27. - one projection layer (size is number of characters plus 1 for CTC blank symbol, 29)
  28. # [Dataset](#contents)
  29. Note that you can run the scripts based on the dataset mentioned in original paper or widely used in relevant domain/network architecture. In the following sections, we will introduce how to run the scripts using the related dataset below.
  30. Dataset used: [LibriSpeech](<http://www.openslr.org/12>)
  31. - Train Data:
  32. - train-clean-100: [6.3G] (training set of 100 hours "clean" speech)
  33. - train-clean-360.tar.gz [23G] (training set of 360 hours "clean" speech)
  34. - train-other-500.tar.gz [30G] (training set of 500 hours "other" speech)
  35. - Val Data:
  36. - dev-clean.tar.gz [337M] (development set, "clean" speech)
  37. - dev-other.tar.gz [314M] (development set, "other", more challenging, speech)
  38. - Test Data:
  39. - test-clean.tar.gz [346M] (test set, "clean" speech )
  40. - test-other.tar.gz [328M] (test set, "other" speech )
  41. - Data format:wav and txt files
  42. - Note:Data will be processed in librispeech.py
  43. # [Environment Requirements](#contents)
  44. - Hardware(GPU)
  45. - Prepare hardware environment with GPU processor.
  46. - Framework
  47. - [MindSpore](https://www.mindspore.cn/install/en)
  48. - For more information, please check the resources below:
  49. - [MindSpore tutorials](https://www.mindspore.cn/tutorial/training/en/master/index.html)
  50. - [MindSpore Python API](https://www.mindspore.cn/doc/api_python/en/master/index.html)
  51. # [Script Description](#contents)
  52. ## [Script and Sample Code](#contents)
  53. ```path
  54. .
  55. ├── audio
  56. ├── deepspeech2
  57. ├── scripts
  58. │ ├──run_distribute_train_gpu.sh // launch distributed training with gpu platform(8p)
  59. │ ├──run_eval_cpu.sh // launch evaluation with cpu platform
  60. │ ├──run_eval_gpu.sh // launch evaluation with gpu platform
  61. │ ├──run_standalone_train_cpu.sh // launch standalone training with cpu platform
  62. │ └──run_standalone_train_gpu.sh // launch standalone training with gpu platform(1p)
  63. ├── train.py // training scripts
  64. ├── eval.py // testing and evaluation outputs
  65. ├── export.py // convert mindspore model to mindir model
  66. ├── labels.json // possible characters to map to
  67. ├── README.md // descriptions about DeepSpeech
  68. ├── deepspeech_pytorch //
  69. ├──decoder.py // decoder from third party codes(MIT License)
  70. ├── src
  71. ├──__init__.py
  72. ├──DeepSpeech.py // DeepSpeech networks
  73. ├──dataset.py // generate dataloader and data processing entry
  74. ├──config.py // DeepSpeech configs
  75. ├──lr_generator.py // learning rate generator
  76. ├──greedydecoder.py // modified greedydecoder for mindspore code
  77. └──callback.py // callbacks to monitor the training
  78. ```
  79. ## [Script Parameters](#contents)
  80. ### Training
  81. ```text
  82. usage: train.py [--use_pretrained USE_PRETRAINED]
  83. [--pre_trained_model_path PRE_TRAINED_MODEL_PATH]
  84. [--is_distributed IS_DISTRIBUTED]
  85. [--bidirectional BIDIRECTIONAL]
  86. [--device_target DEVICE_TARGET]
  87. options:
  88. --pre_trained_model_path pretrained checkpoint path, default is ''
  89. --is_distributed distributed training, default is False
  90. --bidirectional whether or not to use bidirectional RNN, default is True. Currently, only bidirectional model is implemented
  91. --device_target device where the code will be implemented: "GPU" | "CPU", default is "GPU"
  92. ```
  93. ### Evaluation
  94. ```text
  95. usage: eval.py [--bidirectional BIDIRECTIONAL]
  96. [--pretrain_ckpt PRETRAIN_CKPT]
  97. [--device_target DEVICE_TARGET]
  98. options:
  99. --bidirectional whether to use bidirectional RNN, default is True. Currently, only bidirectional model is implemented
  100. --pretrain_ckpt saved checkpoint path, default is ''
  101. --device_target device where the code will be implemented: "GPU" | "CPU", default is "GPU"
  102. ```
  103. ### Options and Parameters
  104. Parameters for training and evaluation can be set in file `config.py`
  105. ```text
  106. config for training.
  107. epochs number of training epoch, default is 70
  108. ```
  109. ```text
  110. config for dataloader.
  111. train_manifest train manifest path, default is 'data/libri_train_manifest.csv'
  112. val_manifest dev manifest path, default is 'data/libri_val_manifest.csv'
  113. batch_size batch size for training, default is 8
  114. labels_path tokens json path for model output, default is "./labels.json"
  115. sample_rate sample rate for the data/model features, default is 16000
  116. window_size window size for spectrogram generation (seconds), default is 0.02
  117. window_stride window stride for spectrogram generation (seconds), default is 0.01
  118. window window type for spectrogram generation, default is 'hamming'
  119. speed_volume_perturb use random tempo and gain perturbations, default is False, not used in current model
  120. spec_augment use simple spectral augmentation on mel spectograms, default is False, not used in current model
  121. noise_dir directory to inject noise into audio. If default, noise Inject not added, default is '', not used in current model
  122. noise_prob probability of noise being added per sample, default is 0.4, not used in current model
  123. noise_min minimum noise level to sample from. (1.0 means all noise, not original signal), default is 0.0, not used in current model
  124. noise_max maximum noise levels to sample from. Maximum 1.0, default is 0.5, not used in current model
  125. ```
  126. ```text
  127. config for model.
  128. rnn_type type of RNN to use in model, default is 'LSTM'. Currently, only LSTM is supported
  129. hidden_size hidden size of RNN Layer, default is 1024
  130. hidden_layers number of RNN layers, default is 5
  131. lookahead_context look ahead context, default is 20, not used in current model
  132. ```
  133. ```text
  134. config for optimizer.
  135. learning_rate initial learning rate, default is 3e-4
  136. learning_anneal annealing applied to learning rate after each epoch, default is 1.1
  137. weight_decay weight decay, default is 1e-5
  138. momentum momentum, default is 0.9
  139. eps Adam eps, default is 1e-8
  140. betas Adam betas, default is (0.9, 0.999)
  141. loss_scale loss scale, default is 1024
  142. ```
  143. ```text
  144. config for checkpoint.
  145. ckpt_file_name_prefix ckpt_file_name_prefix, default is 'DeepSpeech'
  146. ckpt_path path to save ckpt, default is 'checkpoints'
  147. keep_checkpoint_max max number of checkpoints to save, delete older checkpoints, default is 10
  148. ```
  149. # [Training and Eval process](#contents)
  150. Before training, the dataset should be processed. We use the scripts provided by [SeanNaren](https://github.com/SeanNaren/deepspeech.pytorch) to process the dataset.
  151. This script in [SeanNaren](https://github.com/SeanNaren/deepspeech.pytorch) will automatically download the dataset and process it. After the process, the
  152. dataset directory structure is as follows:
  153. ```path
  154. .
  155. ├─ LibriSpeech_dataset
  156. │ ├── train
  157. │ │ ├─ wav
  158. │ │ └─ txt
  159. │ ├── val
  160. │ │ ├─ wav
  161. │ │ └─ txt
  162. │ ├── test_clean
  163. │ │ ├─ wav
  164. │ │ └─ txt
  165. │ └── test_other
  166. │ ├─ wav
  167. │ └─ txt
  168. └─ libri_test_clean_manifest.csv, libri_test_other_manifest.csv, libri_train_manifest.csv, libri_val_manifest.csv
  169. ```
  170. The three *.csv file stores the absolute path of the corresponding
  171. data. After obtaining the 3 csv file, you should modify the configurations in `src/config.py`.
  172. For training config, the train_manifest should be configured with the path of `libri_train_manifest.csv` and for eval config, it should be configured
  173. with `libri_test_other_manifest.csv` or `libri_train_manifest.csv`, depending on which dataset is evaluated.
  174. ```shell
  175. ...
  176. for training configuration
  177. "DataConfig":{
  178. train_manifest:'path_to_csv/libri_train_manifest.csv'
  179. }
  180. for evaluation configuration
  181. "DataConfig":{
  182. train_manifest:'path_to_csv/libri_test_clean_manifest.csv'
  183. }
  184. ```
  185. Before training, some requirements should be installed, including `librosa` and `Levenshtein`
  186. After installing MindSpore via the official website and finishing dataset processing, you can start training as follows:
  187. ```shell
  188. # standalone training gpu
  189. sh ./scripts/run_standalone_train_gpu.sh [DEVICE_ID]
  190. # standalone training cpu
  191. sh ./scripts/run_standalone_train_cpu.sh
  192. # distributed training gpu
  193. sh ./scripts/run_distribute_train_gpu.sh
  194. ```
  195. The following script is used to evaluate the model. Note we only support greedy decoder now and before run the script,
  196. you should download the decoder code from [SeanNaren](https://github.com/SeanNaren/deepspeech.pytorch) and place
  197. deepspeech_pytorch into deepspeech2 directory. After that, the file directory will be displayed as that in [Script and Sample Code]
  198. ```shell
  199. # eval on cpu
  200. sh ./scripts/run_eval_cpu.sh [PATH_CHECKPOINT]
  201. # eval on gpu
  202. sh ./scripts/run_eval_gpu.sh [DEVICE_ID] [PATH_CHECKPOINT]
  203. ```
  204. ## [Export MindIR](#contents)
  205. ```bash
  206. python export.py --pre_trained_model_path='ckpt_path'
  207. ```
  208. # [Model Description](#contents)
  209. ## [Performance](#contents)
  210. ### Training Performance
  211. | Parameters | DeepSpeech |
  212. | -------------------------- | ---------------------------------------------------------------|
  213. | Resource | NV SMX2 V100-32G |
  214. | uploaded Date | 12/29/2020 (month/day/year) |
  215. | MindSpore Version | 1.0.0 |
  216. | Dataset | LibriSpeech |
  217. | Training Parameters | 2p, epoch=70, steps=5144 * epoch, batch_size = 20, lr=3e-4 |
  218. | Optimizer | Adam |
  219. | Loss Function | CTCLoss |
  220. | outputs | probability |
  221. | Loss | 0.2-0.7 |
  222. | Speed | 2p 2.139s/step |
  223. | Total time: training | 2p: around 1 week; |
  224. | Checkpoint | 991M (.ckpt file) |
  225. | Scripts | [DeepSpeech script](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/research/audio/deepspeech2) |
  226. ### Inference Performance
  227. | Parameters | DeepSpeech |
  228. | -------------------------- | ----------------------------------------------------------------|
  229. | Resource | NV SMX2 V100-32G |
  230. | uploaded Date | 12/29/2020 (month/day/year) |
  231. | MindSpore Version | 1.0.0 |
  232. | Dataset | LibriSpeech |
  233. | batch_size | 20 |
  234. | outputs | probability |
  235. | Accuracy(test-clean) | 2p: WER: 9.902 CER: 3.317 8p: WER: 11.593 CER: 3.907|
  236. | Accuracy(test-others) | 2p: WER: 28.693 CER: 12.473 8p: WER: 31.397 CER: 13.696|
  237. | Model for inference | 330M (.mindir file) |
  238. # [ModelZoo Homepage](#contents)
  239. Please check the official [homepage](https://gitee.com/mindspore/mindspore/tree/master/model_zoo).