# Contents

- [DeepSpeech2 Description](#CenterNet-description)
- [Model Architecture](#model-architecture)
- [Dataset](#dataset)
- [Environment Requirements](#environment-requirements)
- [Script Description](#script-description)
    - [Script and Sample Code](#script-and-sample-code)
    - [Script Parameters](#script-parameters)
    - [Training and eval Process](#training-process)
    - [Export MindIR](#convert-process)
        - [Convert](#convert)
- [Model Description](#model-description)
    - [Performance](#performance)
        - [Training Performance](#training-performance)
        - [Inference Performance](#inference-performance)
- [ModelZoo Homepage](#modelzoo-homepage)

# [DeepSpeech2 Description](#contents)

DeepSpeech2 is a speech recognition models which is trained with CTC loss. It replaces entire pipelines of hand-engineered components with neural networks and can handle a diverse variety of speech including noisy
environments, accents and different languages. We support training and evaluation on CPU and GPU.

[Paper](https://arxiv.org/pdf/1512.02595v1.pdf): Amodei, Dario, et al. Deep speech 2: End-to-end speech recognition in english and mandarin.

# [Model Architecture](#contents)

The current reproduced model consists of:

- two convolutional layers:
    - number of channels is 32, kernel size is [41, 11], stride is [2, 2]
    - number of channels is 32, kernel size is [41, 11], stride is [2, 1]
- five bidirectional LSTM layers (size is 1024)
- one projection layer (size is number of characters plus 1 for CTC blank symbol, 29)

# [Dataset](#contents)

Note that you can run the scripts based on the dataset mentioned in original paper or widely used in relevant domain/network architecture. In the following sections, we will introduce how to run the scripts using the related dataset below.

Dataset used: [LibriSpeech](<http://www.openslr.org/12>)

- Train Data：
    - train-clean-100: [6.3G] (training set of 100 hours "clean" speech)
    - train-clean-360.tar.gz [23G] (training set of 360 hours "clean" speech)
    - train-other-500.tar.gz [30G] (training set of 500 hours "other" speech)
- Val Data：
    - dev-clean.tar.gz [337M] (development set, "clean" speech)
    - dev-other.tar.gz [314M] (development set, "other", more challenging, speech)  
- Test Data:
    - test-clean.tar.gz [346M] (test set, "clean" speech )
    - test-other.tar.gz [328M] (test set, "other" speech )
- Data format：wav and txt files
    - Note：Data will be processed in librispeech.py

# [Environment Requirements](#contents)

- Hardware（GPU）
    - Prepare hardware environment with GPU processor.
- Framework
    - [MindSpore](https://www.mindspore.cn/install/en)
- For more information, please check the resources below：
    - [MindSpore tutorials](https://www.mindspore.cn/tutorial/training/en/master/index.html)
    - [MindSpore Python API](https://www.mindspore.cn/doc/api_python/en/master/index.html)

# [Script Description](#contents)

## [Script and Sample Code](#contents)

```path
.
├── audio
    ├── deepspeech2
        ├── scripts
        │   ├──run_distribute_train_gpu.sh // launch distributed training with gpu platform(8p)
        │   ├──run_eval_cpu.sh             // launch evaluation with cpu platform
        │   ├──run_eval_gpu.sh             // launch evaluation with gpu platform
        │   ├──run_standalone_train_cpu.sh // launch standalone training with cpu platform
        │   └──run_standalone_train_gpu.sh // launch standalone training with gpu platform(1p)
        ├── train.py                       // training scripts
        ├── eval.py                        // testing and evaluation outputs
        ├── export.py                      // convert mindspore model to mindir model
        ├── labels.json                    // possible characters to map to
        ├── README.md                      // descriptions about DeepSpeech
        ├── deepspeech_pytorch             //
            ├──decoder.py                  // decoder from third party codes(MIT License)
        ├── src
            ├──__init__.py
            ├──DeepSpeech.py               // DeepSpeech networks
            ├──dataset.py                  // generate dataloader and data processing entry
            ├──config.py                   // DeepSpeech configs
            ├──lr_generator.py             // learning rate generator
            ├──greedydecoder.py            // modified greedydecoder for mindspore code
            └──callback.py                 // callbacks to monitor the training

```

## [Script Parameters](#contents)

### Training

```text
usage: train.py  [--use_pretrained USE_PRETRAINED]
                 [--pre_trained_model_path PRE_TRAINED_MODEL_PATH]
                 [--is_distributed IS_DISTRIBUTED]
                 [--bidirectional BIDIRECTIONAL]
                 [--device_target DEVICE_TARGET]
options:
    --pre_trained_model_path    pretrained checkpoint path, default is ''
    --is_distributed            distributed training, default is False
    --bidirectional             whether or not to use bidirectional RNN, default is True. Currently, only bidirectional model is implemented
    --device_target             device where the code will be implemented: "GPU" | "CPU", default is "GPU"
```

### Evaluation

```text
usage: eval.py  [--bidirectional BIDIRECTIONAL]
                [--pretrain_ckpt PRETRAIN_CKPT]
                [--device_target DEVICE_TARGET]

options:
    --bidirectional              whether to use bidirectional RNN, default is True. Currently, only bidirectional model is implemented
    --pretrain_ckpt              saved checkpoint path, default is ''
    --device_target              device where the code will be implemented: "GPU" | "CPU", default is "GPU"
```

### Options and Parameters

Parameters for training and evaluation can be set in file `config.py`

```text
config for training.
    epochs                       number of training epoch, default is 70
```

```text
config for dataloader.
    train_manifest               train manifest path, default is 'data/libri_train_manifest.csv'
    val_manifest                 dev manifest path, default is 'data/libri_val_manifest.csv'
    batch_size                   batch size for training, default is 8
    labels_path                  tokens json path for model output, default is "./labels.json"
    sample_rate                  sample rate for the data/model features, default is 16000
    window_size                  window size for spectrogram generation (seconds), default is 0.02
    window_stride                window stride for spectrogram generation (seconds), default is 0.01
    window                       window type for spectrogram generation, default is 'hamming'
    speed_volume_perturb         use random tempo and gain perturbations, default is False, not used in current model
    spec_augment                 use simple spectral augmentation on mel spectograms, default is False, not used in current model
    noise_dir                    directory to inject noise into audio. If default, noise Inject not added, default is '', not used in current model
    noise_prob                   probability of noise being added per sample, default is 0.4, not used in current model
    noise_min                    minimum noise level to sample from. (1.0 means all noise, not original signal), default is 0.0, not used in current model
    noise_max                    maximum noise levels to sample from. Maximum 1.0, default is 0.5, not used in current model
```

```text
config for model.
    rnn_type                     type of RNN to use in model, default is 'LSTM'. Currently, only LSTM is supported
    hidden_size                  hidden size of RNN Layer, default is 1024
    hidden_layers                number of RNN layers, default is 5
    lookahead_context            look ahead context, default is 20, not used in current model
```

```text
config for optimizer.
    learning_rate                initial learning rate, default is 3e-4
    learning_anneal              annealing applied to learning rate after each epoch, default is 1.1
    weight_decay                 weight decay, default is 1e-5
    momentum                     momentum, default is 0.9
    eps                          Adam eps, default is 1e-8
    betas                        Adam betas, default is (0.9, 0.999)
    loss_scale                   loss scale, default is 1024
```

```text
config for checkpoint.
    ckpt_file_name_prefix        ckpt_file_name_prefix, default is 'DeepSpeech'
    ckpt_path                    path to save ckpt, default is 'checkpoints'
    keep_checkpoint_max          max number of checkpoints to save, delete older checkpoints, default is 10
```

# [Training and Eval process](#contents)

Before training, the dataset should be processed. We use the scripts provided by [SeanNaren](https://github.com/SeanNaren/deepspeech.pytorch) to process the dataset.
This script in [SeanNaren](https://github.com/SeanNaren/deepspeech.pytorch) will automatically download the dataset and process it. After the process, the
dataset directory structure is as follows:

```path
    .
    ├─ LibriSpeech_dataset
    │  ├── train
    │  │   ├─ wav
    │  │   └─ txt
    │  ├── val
    │  │    ├─ wav
    │  │    └─ txt
    │  ├── test_clean  
    │  │    ├─ wav
    │  │    └─ txt  
    │  └── test_other
    │       ├─ wav
    │       └─ txt
    └─ libri_test_clean_manifest.csv, libri_test_other_manifest.csv, libri_train_manifest.csv, libri_val_manifest.csv
```

The three *.csv file stores the absolute path of the corresponding
data. After obtaining the 3 csv file, you should modify the configurations in `src/config.py`.
For training config, the train_manifest should be configured with the path of `libri_train_manifest.csv` and for eval config, it should be configured
with `libri_test_other_manifest.csv` or `libri_train_manifest.csv`, depending on which dataset is evaluated.

```shell
...
for training configuration
"DataConfig":{
     train_manifest:'path_to_csv/libri_train_manifest.csv'
}

for evaluation configuration
"DataConfig":{
     train_manifest:'path_to_csv/libri_test_clean_manifest.csv'
}

```

Before training, some requirements should be installed, including `librosa` and `Levenshtein`
After installing MindSpore via the official website and finishing dataset processing, you can start training as follows:

```shell

# standalone training gpu
sh ./scripts/run_standalone_train_gpu.sh [DEVICE_ID]

# standalone training cpu
sh ./scripts/run_standalone_train_cpu.sh

# distributed training gpu
sh ./scripts/run_distribute_train_gpu.sh

```

The following script is used to evaluate the model. Note we only support greedy decoder now and before run the script,
you should download the decoder code from [SeanNaren](https://github.com/SeanNaren/deepspeech.pytorch) and place
deepspeech_pytorch into deepspeech2 directory. After that, the file directory will be displayed as that in [Script and Sample Code]

```shell

# eval on cpu
sh ./scripts/run_eval_cpu.sh [PATH_CHECKPOINT]

# eval on gpu
sh ./scripts/run_eval_gpu.sh [DEVICE_ID] [PATH_CHECKPOINT]

```

## [Export MindIR](#contents)

```bash
python export.py --pre_trained_model_path='ckpt_path'
```

# [Model Description](#contents)

## [Performance](#contents)

### Training Performance

| Parameters                 | DeepSpeech                                                      |
| -------------------------- | ---------------------------------------------------------------|
| Resource                   | NV SMX2 V100-32G              |
| uploaded Date              | 12/29/2020 (month/day/year)                                    |
| MindSpore Version          | 1.0.0                                                          |
| Dataset                    | LibriSpeech                                                 |
| Training Parameters        | 2p, epoch=70, steps=5144 * epoch, batch_size = 20, lr=3e-4   |
| Optimizer                  | Adam                                                           |
| Loss Function              | CTCLoss                                |
| outputs                    | probability                                                     |
| Loss                       | 0.2-0.7                                                        |
| Speed                      | 2p 2.139s/step                                   |
| Total time: training       | 2p: around 1 week;                                  |
| Checkpoint                 | 991M (.ckpt file)                                              |
| Scripts                    | [DeepSpeech script](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/research/audio/deepspeech2) |

### Inference Performance

| Parameters                 | DeepSpeech                                                       |
| -------------------------- | ----------------------------------------------------------------|
| Resource                   | NV SMX2 V100-32G                   |
| uploaded Date              | 12/29/2020 (month/day/year)                                 |
| MindSpore Version          | 1.0.0                                                           |
| Dataset                    | LibriSpeech                         |
| batch_size                 | 20                                                               |
| outputs                    | probability                       |
| Accuracy(test-clean)       | 2p: WER: 9.902  CER: 3.317  8p: WER: 11.593  CER: 3.907|
| Accuracy(test-others)      | 2p: WER: 28.693 CER: 12.473 8p: WER: 31.397  CER: 13.696|
| Model for inference        | 330M (.mindir file)                                              |

# [ModelZoo Homepage](#contents)

 Please check the official [homepage](https://gitee.com/mindspore/mindspore/tree/master/model_zoo).