You can not select more than 25 topics Topics must start with a chinese character,a letter or number, can include dashes ('-') and can be up to 35 characters long.

README.md 14 kB

5 years ago
5 years ago
5 years ago
5 years ago
5 years ago
5 years ago
5 years ago
5 years ago
4 years ago
5 years ago
4 years ago
5 years ago
5 years ago
4 years ago
5 years ago
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284
  1. # Contents
  2. - [DeepSpeech2 Description](#CenterNet-description)
  3. - [Model Architecture](#model-architecture)
  4. - [Dataset](#dataset)
  5. - [Environment Requirements](#environment-requirements)
  6. - [Script Description](#script-description)
  7. - [Script and Sample Code](#script-and-sample-code)
  8. - [Script Parameters](#script-parameters)
  9. - [Training and eval Process](#training-process)
  10. - [Export MindIR](#convert-process)
  11. - [Convert](#convert)
  12. - [Model Description](#model-description)
  13. - [Performance](#performance)
  14. - [Training Performance](#training-performance)
  15. - [Inference Performance](#inference-performance)
  16. - [ModelZoo Homepage](#modelzoo-homepage)
  17. # [DeepSpeech2 Description](#contents)
  18. DeepSpeech2 is a speech recognition models which is trained with CTC loss. It replaces entire pipelines of hand-engineered components with neural networks and can handle a diverse variety of speech including noisy
  19. environments, accents and different languages. We support training and evaluation on CPU and GPU.
  20. [Paper](https://arxiv.org/pdf/1512.02595v1.pdf): Amodei, Dario, et al. Deep speech 2: End-to-end speech recognition in english and mandarin.
  21. # [Model Architecture](#contents)
  22. The current reproduced model consists of:
  23. - two convolutional layers:
  24. - number of channels is 32, kernel size is [41, 11], stride is [2, 2]
  25. - number of channels is 32, kernel size is [41, 11], stride is [2, 1]
  26. - five bidirectional LSTM layers (size is 1024)
  27. - one projection layer (size is number of characters plus 1 for CTC blank symbol, 29)
  28. # [Dataset](#contents)
  29. Note that you can run the scripts based on the dataset mentioned in original paper or widely used in relevant domain/network architecture. In the following sections, we will introduce how to run the scripts using the related dataset below.
  30. Dataset used: [LibriSpeech](<http://www.openslr.org/12>)
  31. - Train Data:
  32. - train-clean-100: [6.3G] (training set of 100 hours "clean" speech)
  33. - train-clean-360.tar.gz [23G] (training set of 360 hours "clean" speech)
  34. - train-other-500.tar.gz [30G] (training set of 500 hours "other" speech)
  35. - Val Data:
  36. - dev-clean.tar.gz [337M] (development set, "clean" speech)
  37. - dev-other.tar.gz [314M] (development set, "other", more challenging, speech)
  38. - Test Data:
  39. - test-clean.tar.gz [346M] (test set, "clean" speech )
  40. - test-other.tar.gz [328M] (test set, "other" speech )
  41. - Data format:wav and txt files
  42. - Note:Data will be processed in librispeech.py
  43. # [Environment Requirements](#contents)
  44. - Hardware(GPU)
  45. - Prepare hardware environment with GPU processor.
  46. - Framework
  47. - [MindSpore](https://www.mindspore.cn/install/en)
  48. - For more information, please check the resources below:
  49. - [MindSpore tutorials](https://www.mindspore.cn/tutorial/training/en/master/index.html)
  50. - [MindSpore Python API](https://www.mindspore.cn/doc/api_python/en/master/index.html)
  51. # [Script Description](#contents)
  52. ## [Script and Sample Code](#contents)
  53. ```path
  54. .
  55. ├── audio
  56. ├── deepspeech2
  57. ├── train.py // training scripts
  58. ├── eval.py // testing and evaluation outputs
  59. ├── export.py // convert mindspore model to mindir model
  60. ├── labels.json // possible characters to map to
  61. ├── README.md // descriptions about DeepSpeech
  62. ├── deepspeech_pytorch //
  63. ├──decoder.py // decoder from third party codes(MIT License)
  64. ├── src
  65. ├──__init__.py
  66. ├──DeepSpeech.py // DeepSpeech networks
  67. ├──dataset.py // generate dataloader and data processing entry
  68. ├──config.py // DeepSpeech configs
  69. ├──lr_generator.py // learning rate generator
  70. ├──greedydecoder.py // modified greedydecoder for mindspore code
  71. └──callback.py // callbacks to monitor the training
  72. ```
  73. ## [Script Parameters](#contents)
  74. ### Training
  75. ```text
  76. usage: train.py [--use_pretrained USE_PRETRAINED]
  77. [--pre_trained_model_path PRE_TRAINED_MODEL_PATH]
  78. [--is_distributed IS_DISTRIBUTED]
  79. [--bidirectional BIDIRECTIONAL]
  80. [--device_target DEVICE_TARGET]
  81. options:
  82. --pre_trained_model_path pretrained checkpoint path, default is ''
  83. --is_distributed distributed training, default is False
  84. --bidirectional whether or not to use bidirectional RNN, default is True. Currently, only bidirectional model is implemented
  85. --device_target device where the code will be implemented: "GPU" | "CPU", default is "GPU"
  86. ```
  87. ### Evaluation
  88. ```text
  89. usage: eval.py [--bidirectional BIDIRECTIONAL]
  90. [--pretrain_ckpt PRETRAIN_CKPT]
  91. [--device_target DEVICE_TARGET]
  92. options:
  93. --bidirectional whether to use bidirectional RNN, default is True. Currently, only bidirectional model is implemented
  94. --pretrain_ckpt saved checkpoint path, default is ''
  95. --device_target device where the code will be implemented: "GPU" | "CPU", default is "GPU"
  96. ```
  97. ### Options and Parameters
  98. Parameters for training and evaluation can be set in file `config.py`
  99. ```text
  100. config for training.
  101. epochs number of training epoch, default is 70
  102. ```
  103. ```text
  104. config for dataloader.
  105. train_manifest train manifest path, default is 'data/libri_train_manifest.csv'
  106. val_manifest dev manifest path, default is 'data/libri_val_manifest.csv'
  107. batch_size batch size for training, default is 8
  108. labels_path tokens json path for model output, default is "./labels.json"
  109. sample_rate sample rate for the data/model features, default is 16000
  110. window_size window size for spectrogram generation (seconds), default is 0.02
  111. window_stride window stride for spectrogram generation (seconds), default is 0.01
  112. window window type for spectrogram generation, default is 'hamming'
  113. speed_volume_perturb use random tempo and gain perturbations, default is False, not used in current model
  114. spec_augment use simple spectral augmentation on mel spectograms, default is False, not used in current model
  115. noise_dir directory to inject noise into audio. If default, noise Inject not added, default is '', not used in current model
  116. noise_prob probability of noise being added per sample, default is 0.4, not used in current model
  117. noise_min minimum noise level to sample from. (1.0 means all noise, not original signal), default is 0.0, not used in current model
  118. noise_max maximum noise levels to sample from. Maximum 1.0, default is 0.5, not used in current model
  119. ```
  120. ```text
  121. config for model.
  122. rnn_type type of RNN to use in model, default is 'LSTM'. Currently, only LSTM is supported
  123. hidden_size hidden size of RNN Layer, default is 1024
  124. hidden_layers number of RNN layers, default is 5
  125. lookahead_context look ahead context, default is 20, not used in current model
  126. ```
  127. ```text
  128. config for optimizer.
  129. learning_rate initial learning rate, default is 3e-4
  130. learning_anneal annealing applied to learning rate after each epoch, default is 1.1
  131. weight_decay weight decay, default is 1e-5
  132. momentum momentum, default is 0.9
  133. eps Adam eps, default is 1e-8
  134. betas Adam betas, default is (0.9, 0.999)
  135. loss_scale loss scale, default is 1024
  136. ```
  137. ```text
  138. config for checkpoint.
  139. ckpt_file_name_prefix ckpt_file_name_prefix, default is 'DeepSpeech'
  140. ckpt_path path to save ckpt, default is 'checkpoints'
  141. keep_checkpoint_max max number of checkpoints to save, delete older checkpoints, default is 10
  142. ```
  143. # [Training and Eval process](#contents)
  144. Before training, the dataset should be processed. We use the scripts provided by [SeanNaren](https://github.com/SeanNaren/deepspeech.pytorch) to process the dataset.
  145. This script in [SeanNaren](https://github.com/SeanNaren/deepspeech.pytorch) will automatically download the dataset and process it. After the process, the
  146. dataset directory structure is as follows:
  147. ```path
  148. .
  149. ├─ LibriSpeech_dataset
  150. │ ├── train
  151. │ │ ├─ wav
  152. │ │ └─ txt
  153. │ ├── val
  154. │ │ ├─ wav
  155. │ │ └─ txt
  156. │ ├── test_clean
  157. │ │ ├─ wav
  158. │ │ └─ txt
  159. │ └── test_other
  160. │ ├─ wav
  161. │ └─ txt
  162. └─ libri_test_clean_manifest.csv, libri_test_other_manifest.csv, libri_train_manifest.csv, libri_val_manifest.csv
  163. ```
  164. The three *.csv file stores the absolute path of the corresponding
  165. data. After obtaining the 3 csv file, you should modify the configurations in `src/config.py`.
  166. For training config, the train_manifest should be configured with the path of `libri_train_manifest.csv` and for eval config, it should be configured
  167. with `libri_test_other_manifest.csv` or `libri_train_manifest.csv`, depending on which dataset is evaluated.
  168. ```shell
  169. ...
  170. for training configuration
  171. "DataConfig":{
  172. train_manifest:'path_to_csv/libri_train_manifest.csv'
  173. }
  174. for evaluation configuration
  175. "DataConfig":{
  176. train_manifest:'path_to_csv/libri_test_clean_manifest.csv'
  177. }
  178. ```
  179. Before training, some requirements should be installed, including `librosa` and `Levenshtein`
  180. After installing MindSpore via the official website and finishing dataset processing, you can start training as follows:
  181. ```shell
  182. # standalone training
  183. CUDA_VISIBLE_DEVICES='0' python train.py
  184. # distributed training
  185. CUDA_VISIBLE_DEVICES='0,1,2,3,4,5,6,7' mpirun --allow-run-as-root -n 8 python train.py --is_distributed > log 2>&1 &
  186. ```
  187. The following script is used to evaluate the model. Note we only support greedy decoder now and before run the script,
  188. you should download the decoder code from [SeanNaren](https://github.com/SeanNaren/deepspeech.pytorch) and place
  189. deepspeech_pytorch into deepspeech2 directory. After that, the file directory will be displayed as that in [Script and Sample Code]
  190. ```shell
  191. # eval
  192. CUDA_VISIBLE_DEVICES='0' python eval.py --pretrain_ckpt='saved_model_path'
  193. ```
  194. ## [Export MindIR](#contents)
  195. ```bash
  196. python export.py --pre_trained_model_path='ckpt_path'
  197. ```
  198. # [Model Description](#contents)
  199. ## [Performance](#contents)
  200. ### Training Performance
  201. | Parameters | DeepSpeech |
  202. | -------------------------- | ---------------------------------------------------------------|
  203. | Resource | NV SMX2 V100-32G |
  204. | uploaded Date | 12/29/2020 (month/day/year) |
  205. | MindSpore Version | 1.0.0 |
  206. | Dataset | LibriSpeech |
  207. | Training Parameters | 2p, epoch=70, steps=5144 * epoch, batch_size = 20, lr=3e-4 |
  208. | Optimizer | Adam |
  209. | Loss Function | CTCLoss |
  210. | outputs | probability |
  211. | Loss | 0.2-0.7 |
  212. | Speed | 2p 2.139s/step |
  213. | Total time: training | 2p: around 1 week; |
  214. | Checkpoint | 991M (.ckpt file) |
  215. | Scripts | [DeepSpeech script](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/research/audio/deepspeech2) |
  216. ### Inference Performance
  217. | Parameters | DeepSpeech |
  218. | -------------------------- | ----------------------------------------------------------------|
  219. | Resource | NV SMX2 V100-32G |
  220. | uploaded Date | 12/29/2020 (month/day/year) |
  221. | MindSpore Version | 1.0.0 |
  222. | Dataset | LibriSpeech |
  223. | batch_size | 20 |
  224. | outputs | probability |
  225. | Accuracy(test-clean) | 2p: WER: 9.902 CER: 3.317 8p: WER: 11.593 CER: 3.907|
  226. | Accuracy(test-others) | 2p: WER: 28.693 CER: 12.473 8p: WER: 31.397 CER: 13.696|
  227. | Model for inference | 330M (.mindir file) |
  228. # [ModelZoo Homepage](#contents)
  229. Please check the official [homepage](https://gitee.com/mindspore/mindspore/tree/master/model_zoo).