You can not select more than 25 topics Topics must start with a chinese character,a letter or number, can include dashes ('-') and can be up to 35 characters long.

README.md 13 kB

4 years ago
4 years ago
4 years ago
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262
  1. # Contents
  2. - [DeepSpeech2 Description](#CenterNet-description)
  3. - [Model Architecture](#model-architecture)
  4. - [Dataset](#dataset)
  5. - [Environment Requirements](#environment-requirements)
  6. - [Script Description](#script-description)
  7. - [Script and Sample Code](#script-and-sample-code)
  8. - [Script Parameters](#script-parameters)
  9. - [Training and eval Process](#training-process)
  10. - [Export MindIR](#convert-process)
  11. - [Convert](#convert)
  12. - [Model Description](#model-description)
  13. - [Performance](#performance)
  14. - [Training Performance](#training-performance)
  15. - [Inference Performance](#inference-performance)
  16. - [ModelZoo Homepage](#modelzoo-homepage)
  17. # [DeepSpeech2 Description](#contents)
  18. DeepSpeech2 is a speech recognition models which is trained with CTC loss. It replaces entire pipelines of hand-engineered components with neural networks and can handle a diverse variety of speech including noisy
  19. environments, accents and different languages. We support training and evaluation on GPU.
  20. [Paper](https://arxiv.org/pdf/1512.02595v1.pdf): Amodei, Dario, et al. Deep speech 2: End-to-end speech recognition in english and mandarin.
  21. # [Model Architecture](#contents)
  22. The current reproduced model consists of:
  23. - two convolutional layers:
  24. - number of channels is 32, kernel size is [41, 11], stride is [2, 2]
  25. - number of channels is 32, kernel size is [41, 11], stride is [2, 1]
  26. - five bidirectional LSTM layers (size is 1024)
  27. - one projection layer (size is number of characters plus 1 for CTC blank symbol, 29)
  28. # [Dataset](#contents)
  29. Note that you can run the scripts based on the dataset mentioned in original paper or widely used in relevant domain/network architecture. In the following sections, we will introduce how to run the scripts using the related dataset below.
  30. Dataset used: [LibriSpeech](<http://www.openslr.org/12>)
  31. - Train Data:
  32. - train-clean-100: [6.3G] (training set of 100 hours "clean" speech)
  33. - train-clean-360.tar.gz [23G] (training set of 360 hours "clean" speech)
  34. - train-other-500.tar.gz [30G] (training set of 500 hours "other" speech)
  35. - Val Data:
  36. - dev-clean.tar.gz [337M] (development set, "clean" speech)
  37. - dev-other.tar.gz [314M] (development set, "other", more challenging, speech)
  38. - Test Data:
  39. - test-clean.tar.gz [346M] (test set, "clean" speech )
  40. - test-other.tar.gz [328M] (test set, "other" speech )
  41. - Data format:wav and txt files
  42. - Note:Data will be processed in librispeech.py
  43. # [Environment Requirements](#contents)
  44. - Hardware(GPU)
  45. - Prepare hardware environment with GPU processor.
  46. - Framework
  47. - [MindSpore](https://cmc-szv.clouddragon.huawei.com/cmcversion/index/search?searchKey=Do-MindSpore%20V100R001C00B622)
  48. - For more information, please check the resources below:
  49. - [MindSpore tutorials](https://www.mindspore.cn/tutorial/training/en/master/index.html)
  50. - [MindSpore Python API](https://www.mindspore.cn/doc/api_python/en/master/index.html)
  51. # [Script Description](#contents)
  52. ## [Script and Sample Code](#contents)
  53. ```path
  54. .
  55. ├── audio
  56. ├── deepspeech2
  57. ├── train.py // training scripts
  58. ├── eval.py // testing and evaluation outputs
  59. ├── export.py // convert mindspore model to mindir model
  60. ├── labels.json // possible characters to map to
  61. ├── README.md // descriptions about DeepSpeech
  62. ├── deepspeech_pytorch //
  63. ├──decoder.py // decoder from third party codes(MIT License)
  64. ├── src
  65. ├──__init__.py
  66. ├──DeepSpeech.py // DeepSpeech networks
  67. ├──dataset.py // generate dataloader and data processing entry
  68. ├──config.py // DeepSpeech configs
  69. ├──lr_generator.py // learning rate generator
  70. ├──greedydecoder.py // modified greedydecoder for mindspore code
  71. └──callback.py // callbacks to monitor the training
  72. ```
  73. ## [Script Parameters](#contents)
  74. ### Training
  75. ```text
  76. usage: train.py [--use_pretrained USE_PRETRAINED]
  77. [--pre_trained_model_path PRE_TRAINED_MODEL_PATH]
  78. [--is_distributed IS_DISTRIBUTED]
  79. [--bidirectional BIDIRECTIONAL]
  80. options:
  81. --pre_trained_model_path pretrained checkpoint path, default is ''
  82. --is_distributed distributed training, default is False
  83. --bidirectional whether or not to use bidirectional RNN, default is True. Currently, only bidirectional model is implemented
  84. ```
  85. ### Evaluation
  86. ```text
  87. usage: eval.py [--bidirectional BIDIRECTIONAL]
  88. [--pretrain_ckpt PRETRAIN_CKPT]
  89. options:
  90. --bidirectional whether to use bidirectional RNN, default is True. Currently, only bidirectional model is implemented
  91. --pretrain_ckpt saved checkpoint path, default is ''
  92. ```
  93. ### Options and Parameters
  94. Parameters for training and evaluation can be set in file `config.py`
  95. ```text
  96. config for training.
  97. epochs number of training epoch, default is 70
  98. ```
  99. ```text
  100. config for dataloader.
  101. train_manifest train manifest path, default is 'data/libri_train_manifest.csv'
  102. val_manifest dev manifest path, default is 'data/libri_val_manifest.csv'
  103. batch_size batch size for training, default is 8
  104. labels_path tokens json path for model output, default is "./labels.json"
  105. sample_rate sample rate for the data/model features, default is 16000
  106. window_size window size for spectrogram generation (seconds), default is 0.02
  107. window_stride window stride for spectrogram generation (seconds), default is 0.01
  108. window window type for spectrogram generation, default is 'hamming'
  109. speed_volume_perturb use random tempo and gain perturbations, default is False, not used in current model
  110. spec_augment use simple spectral augmentation on mel spectograms, default is False, not used in current model
  111. noise_dir directory to inject noise into audio. If default, noise Inject not added, default is '', not used in current model
  112. noise_prob probability of noise being added per sample, default is 0.4, not used in current model
  113. noise_min minimum noise level to sample from. (1.0 means all noise, not original signal), default is 0.0, not used in current model
  114. noise_max maximum noise levels to sample from. Maximum 1.0, default is 0.5, not used in current model
  115. ```
  116. ```text
  117. config for model.
  118. rnn_type type of RNN to use in model, default is 'LSTM'. Currently, only LSTM is supported
  119. hidden_size hidden size of RNN Layer, default is 1024
  120. hidden_layers number of RNN layers, default is 5
  121. lookahead_context look ahead context, default is 20, not used in current model
  122. ```
  123. ```text
  124. config for optimizer.
  125. learning_rate initial learning rate, default is 3e-4
  126. learning_anneal annealing applied to learning rate after each epoch, default is 1.1
  127. weight_decay weight decay, default is 1e-5
  128. momentum momentum, default is 0.9
  129. eps Adam eps, default is 1e-8
  130. betas Adam betas, default is (0.9, 0.999)
  131. loss_scale loss scale, default is 1024
  132. ```
  133. ```text
  134. config for checkpoint.
  135. ckpt_file_name_prefix ckpt_file_name_prefix, default is 'DeepSpeech'
  136. ckpt_path path to save ckpt, default is 'checkpoints'
  137. keep_checkpoint_max max number of checkpoints to save, delete older checkpoints, default is 10
  138. ```
  139. # [Training and Eval process](#contents)
  140. Before training, the dataset should be processed. We use the scripts provided by [SeanNaren](https://github.com/SeanNaren/deepspeech.pytorch) to process the dataset.
  141. This script in [SeanNaren](https://github.com/SeanNaren/deepspeech.pytorch) will automatically download the dataset and process it. After the process, the
  142. dataset directory structure is as follows:
  143. ```path
  144. .
  145. ├─ LibriSpeech_dataset
  146. │ ├── train
  147. │ │ ├─ wav
  148. │ │ └─ txt
  149. │ ├── val
  150. │ │ ├─ wav
  151. │ │ └─ txt
  152. │ ├── test_clean
  153. │ │ ├─ wav
  154. │ │ └─ txt
  155. │ └── test_other
  156. │ ├─ wav
  157. │ └─ txt
  158. └─ libri_test_clean_manifest.csv, libri_test_other_manifest.csv, libri_train_manifest.csv, libri_val_manifest.csv
  159. ```
  160. The three *.csv file stores the absolute path of the corresponding
  161. data. The three*.csv files will be used in training and evaluation process.
  162. After installing MindSpore via the official website and finishing dataset processing, you can start training as follows:
  163. ```shell
  164. # standalone training
  165. CUDA_VISIBLE_DEVICES='0' python train.py
  166. # distributed training
  167. CUDA_VISIBLE_DEVICES='0,1,2,3,4,5,6,7' mpirun --allow-run-as-root -n 8 python train.py --is_distributed=True > log 2>&1 &
  168. ```
  169. The following script is used to evaluate the model. Note we only support greedy decoder now and before run the script,
  170. you should download the decoder code from [SeanNaren](https://github.com/SeanNaren/deepspeech.pytorch) and place
  171. deepspeech_pytorch into deepspeech2 directory. After that, the file directory will be displayed as that in [Script and Sample Code]
  172. ```shell
  173. # eval
  174. CUDA_VISIBLE_DEVICES='0' python eval.py --pretrain_ckpt='saved_model_path'
  175. ```
  176. ## [Export MindIR](#contents)
  177. ```bash
  178. python export.py --pre_trained_model_path='ckpt_path'
  179. ```
  180. # [Model Description](#contents)
  181. ## [Performance](#contents)
  182. ### Training Performance
  183. | Parameters | DeepSpeech |
  184. | -------------------------- | ---------------------------------------------------------------|
  185. | Resource | NV SMX2 V100-32G |
  186. | uploaded Date | 12/29/2020 (month/day/year) |
  187. | MindSpore Version | 1.0.0 |
  188. | Dataset | LibriSpeech |
  189. | Training Parameters | 2p, epoch=70, steps=5144 * epoch, batch_size = 20, lr=3e-4 |
  190. | Optimizer | Adam |
  191. | Loss Function | CTCLoss |
  192. | outputs | probability |
  193. | Loss | 0.2-0.7 |
  194. | Speed | 2p 2.139s/step |
  195. | Total time: training | 2p: around 1 week; |
  196. | Checkpoint | 991M (.ckpt file) |
  197. | Scripts | [DeepSpeech script](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/research/audio/deepspeech) |
  198. ### Inference Performance
  199. | Parameters | DeepSpeech |
  200. | -------------------------- | ----------------------------------------------------------------|
  201. | Resource | NV SMX2 V100-32G |
  202. | uploaded Date | 12/29/2020 (month/day/year) |
  203. | MindSpore Version | 1.0.0 |
  204. | Dataset | LibriSpeech |
  205. | batch_size | 20 |
  206. | outputs | probability |
  207. | Accuracy(test-clean) | WER: 9.732 CER: 3.270|
  208. | Accuracy(test-others) | WER: 28.198 CER: 12.253|
  209. | Model for inference | 330M (.mindir file) |
  210. # [ModelZoo Homepage](#contents)
  211. Please check the official [homepage](https://gitee.com/mindspore/mindspore/tree/master/model_zoo).