Huawei_Technology
/
mindspore

# Contents

- [DeepSpeech2 Description](#CenterNet-description)
- [Model Architecture](#model-architecture)
- [Dataset](#dataset)
- [Environment Requirements](#environment-requirements)
- [Script Description](#script-description)
    - [Script and Sample Code](#script-and-sample-code)
    - [Script Parameters](#script-parameters)
    - [Training and eval Process](#training-process)
    - [Export MindIR](#convert-process)
        - [Convert](#convert)
- [Model Description](#model-description)
    - [Performance](#performance)
        - [Training Performance](#training-performance)
        - [Inference Performance](#inference-performance)
- [ModelZoo Homepage](#modelzoo-homepage)

# [DeepSpeech2 Description](#contents)

DeepSpeech2 is a speech recognition models which is trained with CTC loss. It replaces entire pipelines of hand-engineered components with neural networks and can handle a diverse variety of speech including noisy
environments, accents and different languages. We support training and evaluation on CPU and GPU.

[Paper](https://arxiv.org/pdf/1512.02595v1.pdf): Amodei, Dario, et al. Deep speech 2: End-to-end speech recognition in english and mandarin.

# [Model Architecture](#contents)

The current reproduced model consists of:

- two convolutional layers:
    - number of channels is 32, kernel size is [41, 11], stride is [2, 2]
    - number of channels is 32, kernel size is [41, 11], stride is [2, 1]
- five bidirectional LSTM layers (size is 1024)
- one projection layer (size is number of characters plus 1 for CTC blank symbol, 29)

# [Dataset](#contents)

Note that you can run the scripts based on the dataset mentioned in original paper or widely used in relevant domain/network architecture. In the following sections, we will introduce how to run the scripts using the related dataset below.

Dataset used: [LibriSpeech](<http://www.openslr.org/12>)

- Train Data：
    - train-clean-100: [6.3G] (training set of 100 hours "clean" speech)
    - train-clean-360.tar.gz [23G] (training set of 360 hours "clean" speech)
    - train-other-500.tar.gz [30G] (training set of 500 hours "other" speech)
- Val Data：
    - dev-clean.tar.gz [337M] (development set, "clean" speech)
    - dev-other.tar.gz [314M] (development set, "other", more challenging, speech)  
- Test Data:
    - test-clean.tar.gz [346M] (test set, "clean" speech )
    - test-other.tar.gz [328M] (test set, "other" speech )
- Data format：wav and txt files
    - Note：Data will be processed in librispeech.py

# [Environment Requirements](#contents)

- Hardware（GPU）
    - Prepare hardware environment with GPU processor.
- Framework
    - [MindSpore](https://www.mindspore.cn/install/en)
- For more information, please check the resources below：
    - [MindSpore tutorials](https://www.mindspore.cn/tutorial/training/en/master/index.html)
    - [MindSpore Python API](https://www.mindspore.cn/doc/api_python/en/master/index.html)

# [Script Description](#contents)

## [Script and Sample Code](#contents)

```path
.
├── audio
    ├── deepspeech2
        ├── train.py                     // training scripts
        ├── eval.py                      // testing and evaluation outputs
        ├── export.py                    // convert mindspore model to mindir model
        ├── labels.json                  // possible characters to map to
        ├── README.md                    // descriptions about DeepSpeech
        ├── deepspeech_pytorch           //
            ├──decoder.py                // decoder from third party codes(MIT License)
        ├── src
            ├──__init__.py
            ├──DeepSpeech.py             // DeepSpeech networks
            ├──dataset.py                // generate dataloader and data processing entry
            ├──config.py                 // DeepSpeech configs
            ├──lr_generator.py           // learning rate generator
            ├──greedydecoder.py          // modified greedydecoder for mindspore code
            └──callback.py               // callbacks to monitor the training

```

## [Script Parameters](#contents)

### Training

```text
usage: train.py  [--use_pretrained USE_PRETRAINED]
                 [--pre_trained_model_path PRE_TRAINED_MODEL_PATH]
                 [--is_distributed IS_DISTRIBUTED]
                 [--bidirectional BIDIRECTIONAL]
                 [--device_target DEVICE_TARGET]
options:
    --pre_trained_model_path    pretrained checkpoint path, default is ''
    --is_distributed            distributed training, default is False
    --bidirectional             whether or not to use bidirectional RNN, default is True. Currently, only bidirectional model is implemented
    --device_target             device where the code will be implemented: "GPU" | "CPU", default is "GPU"
```

### Evaluation

```text
usage: eval.py  [--bidirectional BIDIRECTIONAL]
                [--pretrain_ckpt PRETRAIN_CKPT]
                [--device_target DEVICE_TARGET]

options:
    --bidirectional              whether to use bidirectional RNN, default is True. Currently, only bidirectional model is implemented
    --pretrain_ckpt              saved checkpoint path, default is ''
    --device_target              device where the code will be implemented: "GPU" | "CPU", default is "GPU"
```

### Options and Parameters

Parameters for training and evaluation can be set in file `config.py`

```text
config for training.
    epochs                       number of training epoch, default is 70
```

```text
config for dataloader.
    train_manifest               train manifest path, default is 'data/libri_train_manifest.csv'
    val_manifest                 dev manifest path, default is 'data/libri_val_manifest.csv'
    batch_size                   batch size for training, default is 8
    labels_path                  tokens json path for model output, default is "./labels.json"
    sample_rate                  sample rate for the data/model features, default is 16000
    window_size                  window size for spectrogram generation (seconds), default is 0.02
    window_stride                window stride for spectrogram generation (seconds), default is 0.01
    window                       window type for spectrogram generation, default is 'hamming'
    speed_volume_perturb         use random tempo and gain perturbations, default is False, not used in current model
    spec_augment                 use simple spectral augmentation on mel spectograms, default is False, not used in current model
    noise_dir                    directory to inject noise into audio. If default, noise Inject not added, default is '', not used in current model
    noise_prob                   probability of noise being added per sample, default is 0.4, not used in current model
    noise_min                    minimum noise level to sample from. (1.0 means all noise, not original signal), default is 0.0, not used in current model
    noise_max                    maximum noise levels to sample from. Maximum 1.0, default is 0.5, not used in current model
```

```text
config for model.
    rnn_type                     type of RNN to use in model, default is 'LSTM'. Currently, only LSTM is supported
    hidden_size                  hidden size of RNN Layer, default is 1024
    hidden_layers                number of RNN layers, default is 5
    lookahead_context            look ahead context, default is 20, not used in current model
```

```text
config for optimizer.
    learning_rate                initial learning rate, default is 3e-4
    learning_anneal              annealing applied to learning rate after each epoch, default is 1.1
    weight_decay                 weight decay, default is 1e-5
    momentum                     momentum, default is 0.9
    eps                          Adam eps, default is 1e-8
    betas                        Adam betas, default is (0.9, 0.999)
    loss_scale                   loss scale, default is 1024
```

```text
config for checkpoint.
    ckpt_file_name_prefix        ckpt_file_name_prefix, default is 'DeepSpeech'
    ckpt_path                    path to save ckpt, default is 'checkpoints'
    keep_checkpoint_max          max number of checkpoints to save, delete older checkpoints, default is 10
```

# [Training and Eval process](#contents)

Before training, the dataset should be processed. We use the scripts provided by [SeanNaren](https://github.com/SeanNaren/deepspeech.pytorch) to process the dataset.
This script in [SeanNaren](https://github.com/SeanNaren/deepspeech.pytorch) will automatically download the dataset and process it. After the process, the
dataset directory structure is as follows:

```path
    .
    ├─ LibriSpeech_dataset
    │  ├── train
    │  │   ├─ wav
    │  │   └─ txt
    │  ├── val
    │  │    ├─ wav
    │  │    └─ txt
    │  ├── test_clean  
    │  │    ├─ wav
    │  │    └─ txt  
    │  └── test_other
    │       ├─ wav
    │       └─ txt
    └─ libri_test_clean_manifest.csv, libri_test_other_manifest.csv, libri_train_manifest.csv, libri_val_manifest.csv
```

The three *.csv file stores the absolute path of the corresponding
data. After obtaining the 3 csv file, you should modify the configurations in `src/config.py`.
For training config, the train_manifest should be configured with the path of `libri_train_manifest.csv` and for eval config, it should be configured
with `libri_test_other_manifest.csv` or `libri_train_manifest.csv`, depending on which dataset is evaluated.

```shell
...
for training configuration
"DataConfig":{
     train_manifest:'path_to_csv/libri_train_manifest.csv'
}

for evaluation configuration
"DataConfig":{
     train_manifest:'path_to_csv/libri_test_clean_manifest.csv'
}

```

Before training, some requirements should be installed, including `librosa` and `Levenshtein`
After installing MindSpore via the official website and finishing dataset processing, you can start training as follows:

```shell

# standalone training
CUDA_VISIBLE_DEVICES='0' python train.py

# distributed training
CUDA_VISIBLE_DEVICES='0,1,2,3,4,5,6,7' mpirun --allow-run-as-root -n 8 python train.py --is_distributed > log 2>&1 &

```

The following script is used to evaluate the model. Note we only support greedy decoder now and before run the script,
you should download the decoder code from [SeanNaren](https://github.com/SeanNaren/deepspeech.pytorch) and place
deepspeech_pytorch into deepspeech2 directory. After that, the file directory will be displayed as that in [Script and Sample Code]

```shell

# eval
CUDA_VISIBLE_DEVICES='0' python eval.py --pretrain_ckpt='saved_model_path'
```

## [Export MindIR](#contents)

```bash
python export.py --pre_trained_model_path='ckpt_path'
```

# [Model Description](#contents)

## [Performance](#contents)

### Training Performance

| Parameters                 | DeepSpeech                                                      |
| -------------------------- | ---------------------------------------------------------------|
| Resource                   | NV SMX2 V100-32G              |
| uploaded Date              | 12/29/2020 (month/day/year)                                    |
| MindSpore Version          | 1.0.0                                                          |
| Dataset                    | LibriSpeech                                                 |
| Training Parameters        | 2p, epoch=70, steps=5144 * epoch, batch_size = 20, lr=3e-4   |
| Optimizer                  | Adam                                                           |
| Loss Function              | CTCLoss                                |
| outputs                    | probability                                                     |
| Loss                       | 0.2-0.7                                                        |
| Speed                      | 2p 2.139s/step                                   |
| Total time: training       | 2p: around 1 week;                                  |
| Checkpoint                 | 991M (.ckpt file)                                              |
| Scripts                    | [DeepSpeech script](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/research/audio/deepspeech2) |

### Inference Performance

| Parameters                 | DeepSpeech                                                       |
| -------------------------- | ----------------------------------------------------------------|
| Resource                   | NV SMX2 V100-32G                   |
| uploaded Date              | 12/29/2020 (month/day/year)                                 |
| MindSpore Version          | 1.0.0                                                           |
| Dataset                    | LibriSpeech                         |
| batch_size                 | 20                                                               |
| outputs                    | probability                       |
| Accuracy(test-clean)       | 2p: WER: 9.902  CER: 3.317  8p: WER: 11.593  CER: 3.907|
| Accuracy(test-others)      | 2p: WER: 28.693 CER: 12.473 8p: WER: 31.397  CER: 13.696|
| Model for inference        | 330M (.mindir file)                                              |

# [ModelZoo Homepage](#contents)

 Please check the official [homepage](https://gitee.com/mindspore/mindspore/tree/master/model_zoo).