You can not select more than 25 topics Topics must start with a chinese character,a letter or number, can include dashes ('-') and can be up to 35 characters long.

README.md 15 kB

4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
4 years ago
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276
  1. # Contents
  2. - [WaveNet Description](#WaveNet-description)
  3. - [Model Architecture](#model-architecture)
  4. - [Dataset](#dataset)
  5. - [Environment Requirements](#environment-requirements)
  6. - [Script Description](#script-description)
  7. - [Script and Sample Code](#script-and-sample-code)
  8. - [Script Parameters](#script-parameters)
  9. - [Training Process](#training-process)
  10. - [Evaluation Process](#evaluation-process)
  11. - [Convert Process](#convert-process)
  12. - [Model Description](#model-description)
  13. - [Performance](#performance)
  14. - [Training Performance](#training-performance)
  15. - [Inference Performance](#inference-performance)
  16. - [ModelZoo Homepage](#modelzoo-homepage)
  17. # [WaveNet Description](#contents)
  18. WaveNet is a deep neural network for generating raw audio waveforms. The model is fully probabilistic and autoregressive, with the predictive distribution for each audio sample conditioned on all previous ones. We support training and evaluation on both GPU and CPU.
  19. [Paper](https://arxiv.org/pdf/1609.03499.pdf): ord A, Dieleman S, Zen H, et al. Wavenet: A generative model for raw audio
  20. # [Model Architecture](#contents)
  21. The current model consists of a pre-convolution layer, followed by several residual block which has residual and skip connection with gated activation units.
  22. Finally, post convolution layers are added to predict the distribution.
  23. # [Dataset](#contents)
  24. In the following sections, we will introduce how to run the scripts using the related dataset below.
  25. Dataset used: [The LJ Speech Dataset](<https://keithito.com/LJ-Speech-Dataset>)
  26. - Dataset size:2.6G
  27. - Data format:audio clips(13100) and transcription
  28. - The dataset structure is as follows:
  29. ```path
  30. .
  31. └── LJSpeech-1.1
  32. ├─ wavs //audio clips files
  33. └─ metadata.csv //transcripts
  34. ```
  35. # [Environment Requirements](#contents)
  36. - Hardware(GPU/CPU)
  37. - Prepare hardware environment with GPU/CPU processor.
  38. - Framework
  39. - [MindSpore](https://www.mindspore.cn/install/en)
  40. - For more information, please check the resources below:
  41. - [MindSpore tutorials](https://www.mindspore.cn/tutorial/training/en/master/index.html)
  42. - [MindSpore Python API](https://www.mindspore.cn/doc/api_python/en/master/index.html)
  43. # [Script Description](#contents)
  44. ## [Script and Sample Code](#contents)
  45. **Note that some of the scripts described below are not included our code**. These scripts should first be download them from [r9y9](https://github.com/r9y9/wavenet_vocoder) and added into this project.
  46. ```path
  47. .
  48. ├── audio
  49. └──wavenet
  50. ├── scripts
  51. │ ├──run_distribute_train_gpu.sh // launch distributed training with gpu platform(8p)
  52. │ ├──run_eval_cpu.sh // launch evaluation with cpu platform
  53. │ ├──run_eval_gpu.sh // launch evaluation with gpu platform
  54. │ ├──run_standalone_train_cpu.sh // launch standalone training with cpu platform
  55. │ └──run_standalone_train_gpu.sh // launch standalone training with gpu platform(1p)
  56. ├──datasets // Note the datasets folder should be downloaded from the above link
  57. ├──egs // Note the egs folder should be downloaded from the above link
  58. ├──utils // Note the utils folder should be downloaded from the above link
  59. ├── audio.py // Audio utils. Note this script should be downloaded from the above link
  60. ├── compute-meanvar-stats.py // Compute mean-variance normalization stats. Note this script should be downloaded from the above link
  61. ├── evaluate.py // Evaluation
  62. ├── export.py // Convert mindspore model to air model
  63. ├── hparams.py // Hyper-parameter configuration. Note this script should be downloaded from the above link
  64. ├── mksubset.py // Make subset of dataset. Note this script should be downloaded from the above link
  65. ├── preprocess.py // Preprocess dataset. Note this script should be downloaded from the above link
  66. ├── preprocess_normalize.py // Perform meanvar normalization to preprocessed features. Note this script should be downloaded from the above link
  67. ├── README.md // Descriptions about WaveNet
  68. ├── train.py // Training scripts
  69. ├── train_pytorch.py // Note this script should be downloaded from the above link. The initial name of this script is train.py in the project from the link
  70. ├── src
  71. │ ├──__init__.py
  72. │ ├──dataset.py // Generate dataloader and data processing entry
  73. │ ├──callback.py // Callbacks to monitor the training
  74. │ ├──lr_generator.py // Learning rate generator
  75. │ └──loss.py // Loss function definition
  76. └── wavenet_vocoder
  77. ├──__init__.py
  78. ├──conv.py // Extended 1D convolution
  79. ├──mixture.py // Loss function for training and sample function for testing
  80. ├──modules.py // Modules for Wavenet construction
  81. ├──upsample.py // Upsample layer definition
  82. ├──util.py // Utils. Note this script should be downloaded from the above link
  83. ├──wavenet.py // WaveNet networks
  84. └──tfcompat // Note this script should be downloaded from the above link
  85. ├──__init__.py
  86. └──hparam.py // Param management tools
  87. ```
  88. ## [Script Parameters](#contents)
  89. ### Training
  90. ```text
  91. usage: train.py [--data_path DATA_PATH] [--preset PRESET]
  92. [--checkpoint_dir CHECKPOINT_DIR] [--checkpoint CHECKPOINT]
  93. [--speaker_id SPEAKER_ID] [--platform PLATFORM]
  94. [--is_distributed IS_DISTRIBUTED]
  95. options:
  96. --data_path dataset path
  97. --preset path of preset parameters (json)
  98. --checkpoint_dir directory of saving model checkpoints
  99. --checkpoint pre-trained ckpt path, default is "./checkpoints"
  100. --speaker_id specific speaker of data in case for multi-speaker datasets, not used currently
  101. --platform specify platform to be used, defeault is "GPU"
  102. --is_distributed whether distributed training or not
  103. ```
  104. ### Evaluation
  105. ```text
  106. usage: evaluate.py [--data_path DATA_PATH] [--preset PRESET]
  107. [--pretrain_ckpt PRETRAIN_CKPT] [--is_numpy]
  108. [--output_path OUTPUT_PATH] [--speaker_id SPEAKER_ID]
  109. [--platform PLATFORM]
  110. options:
  111. --data_path dataset path
  112. --preset path of preset parameters (json)
  113. --pretrain_ckpt pre-trained ckpt path
  114. --is_numpy whether using numpy for inference or not
  115. --output_path path to save synthesized audio
  116. --speaker_id specific speaker of data in case for multi-speaker datasets, not used currently
  117. --platform specify platform to be used, defeault is "GPU"
  118. ```
  119. More parameters for training and evaluation can be set in file `hparams.py`.
  120. ## [Training Process](#contents)
  121. Before your first training, some dependency scripts should be downloaded and placed in correct directory as described in [Script and Sample Code].
  122. After that, raw data should be pre-processed by using the scripts in `egs`. The directory of egs is as follows:
  123. ```path
  124. .
  125. ├── egs
  126. ├──gaussian
  127. │ ├──conf
  128. │ │ ├──gaussian_wavenet.json
  129. │ │ └──gaussian_wavenet_demo.json
  130. │ └──run.sh
  131. ├──mol
  132. │ ├──conf
  133. │ │ ├──mol_wavenet.json
  134. │ │ └──mol_wavenet_demo.json
  135. │ └──run.sh
  136. ├──mulaw256
  137. │ ├──conf
  138. │ │ ├──mulaw_wavenet.json
  139. │ │ └──mulaw_wavenet_demo.json
  140. │ └──run.sh
  141. └──README.md
  142. ```
  143. In this project, three different losses are implemented to train the network:
  144. - mulaw256: categorical output distribution. The input is 8-bit mulaw quantized waveform.
  145. - mol: discretized mix logistic loss. The input is 16-bit raw audio.
  146. - gaussian: mix gaussian loss. The input is 16-bit raw audio.
  147. The three folder gaussian, mol, mulaw is used to generate corresponding training data respectively. For example, To generate the training data for
  148. mix gaussian loss, you should first modify the `run.sh` in line 28. Change `conf/gaussian_wavenet_demo.json` to
  149. `conf/gaussian_wavenet.json`. We use the default parameter in `gaussian_wavenet.json`. By this setting, data will be generated to adapt to mix gaussian loss and
  150. some parameters in `hparams.py` will be covered by that in `gaussian_wavenet.json`. You can also define your own hyper-parameter json here. After the modification,
  151. The following command can be ran for data generation. Note that if you want to change values of some parameters, you may need to modify in `gaussian_wavenet.json` instead of `hparams.py` since `gaussian_wavenet.json` may cover that in`hparams.py`.
  152. ```bash
  153. bash run.sh --stage 0 --stop-stage 0 --db-root /path_to_dataset/LJSpeech-1.1/wavs
  154. bash run.sh --stage 1 --stop-stage 1
  155. ```
  156. After the processing, the directory of gaussian will be as follows:
  157. ```path
  158. .
  159. ├── gaussian
  160. ├──conf
  161. ├──data
  162. ├──exp
  163. └──dump
  164. └──lj
  165. └──logmelspectrogram
  166. ├──org
  167. └──norm
  168. ├──train_no_dev
  169. ├──dev
  170. └──eval
  171. ```
  172. The train_no_dev folder contains the final training data. For mol and gaussian, the process is the same. When the training data is prepared,
  173. you can run the following command to train the network:
  174. ```bash
  175. Standalone training
  176. GPU:
  177. sh ./scripts/run_standalone_train_gpu.sh [CUDA_DEVICE_ID] [/path_to_egs/egs/gaussian/dump/lj/logmelspectrogram/norm/] [/path_to_egs/egs/gaussian/conf/gaussian_wavenet.json] [path_to_save_ckpt]
  178. CPU:
  179. sh ./scripts/run_standalone_train_cpu.sh [/path_to_egs/egs/gaussian/dump/lj/logmelspectrogram/norm/] [/path_to_egs/egs/gaussian/conf/gaussian_wavenet.json] [path_to_save_ckpt]
  180. Distributed training(8p)
  181. sh ./scripts/run_distribute_train_gpu.sh [/path_to_egs/egs/gaussian/dump/lj/logmelspectrogram/norm/] [/path_to_egs/egs/gaussian/conf/gaussian_wavenet.json] [path_to_save_ckpt]
  182. ```
  183. ## [Evaluation Process](#contents)
  184. WaveNet has a process of auto-regression and this process currently cannot be run in Graph mode(place the auto-regression into `construct`). Therefore, we implement the process in a common function. Here, we provide two kinds of ways to realize the function: using Numpy or using MindSpore ops. One can set `is_numpy` to determine which mode is used. We recommend using numpy since it is much faster than using MindSpore ops. This is because the auto-regression process only calls some simple operation like Matmul and Bias_add. Unlike Graph mode, there will exist some fixed cost each step and this leads to a lower speed. For more information, please refer to
  185. this [link](https://bbs.huaweicloud.com/forum/thread-94852-1-1.html)
  186. ```bash
  187. Evaluation
  188. GPU (using numpy):
  189. sh ./scripts/run_eval_gpu.sh [CUDA_DEVICE_ID] [/path_to_egs/egs/gaussian/dump/lj/logmelspectrogram/norm/] [/path_to_egs/egs/gaussian/conf/gaussian_wavenet.json] [path_to_load_ckpt] is_numpy [path_to_save_audio]
  190. GPU (using mindspore):
  191. sh ./scripts/run_eval_gpu.sh [CUDA_DEVICE_ID] [/path_to_egs/egs/gaussian/dump/lj/logmelspectrogram/norm/] [/path_to_egs/egs/gaussian/conf/gaussian_wavenet.json] [path_to_load_ckpt] [path_to_save_audio]
  192. CPU:
  193. sh ./scripts/run_eval_cpu.sh [/path_to_egs/egs/gaussian/dump/lj/logmelspectrogram/norm/] [/path_to_egs/egs/gaussian/conf/gaussian_wavenet.json] [path_to_load_ckpt] [is_numpy] [path_to_save_audio]
  194. ```
  195. ## [Convert Process](#contents)
  196. ```bash
  197. GPU:
  198. python export.py --preset=/path_to_egs/egs/gaussian/conf/gaussian_wavenet.json --checkpoint_dir=path_to_dump_hparams --pretrain_ckpt=path_to_load_ckpt
  199. CPU:
  200. python export.py --preset=/path_to_egs/egs/gaussian/conf/gaussian_wavenet.json --checkpoint_dir=path_to_dump_hparams --pretrain_ckpt=path_to_load_ckpt --platform=CPU
  201. ```
  202. # [Model Description](#contents)
  203. ## [Performance](#contents)
  204. ### Training Performance on GPU
  205. | Parameters | WaveNet |
  206. | -------------------------- | ---------------------------------------------------------------|
  207. | Resource | NV SMX2 V100-32G |
  208. | uploaded Date | 01/14/2021 (month/day/year) |
  209. | MindSpore Version | 1.0.0 |
  210. | Dataset | LJSpeech-1.1 |
  211. | Training Parameters | 1p, epoch=600(max), steps=1635 * epoch, batch_size = 8, lr=1e-3 |
  212. | Optimizer | Adam |
  213. | Loss Function | SoftmaxCrossEntropyWithLogits/discretized_mix_logistic/mix_gaussian |
  214. | Loss | around 2.0(mulaw256)/around 4.5(mol)/around -6.0(gaussian) |
  215. | Speed | 1p 1.467s/step |
  216. | Total time: training | 1p(mol/gaussian): around 4 days; 2p(mulaw256):around 1 week |
  217. | Checkpoint | 59.79MM/54.87M/54.83M (.ckpt file) |
  218. | Scripts | [WaveNet script](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/research/audio/wavenet) |
  219. ### Inference Performance On GPU
  220. Audio samples will be demonstrated online soon.
  221. # [ModelZoo Homepage](#contents)
  222. Please check the official [homepage](https://gitee.com/mindspore/mindspore/tree/master/model_zoo).