TensorflowASR Download - TensorflowASR Source code download

TensorflowASR

C/C++

1.0.0

Download

TensorflowASR

The end-to-end speech recognition model based on the Conformer version of Tensorflow 2, and the CPU's RTF (real-time rate) is around 0.1

The current branch is V2 version, which is a CTC+translate structure

Welcome to use and feedback bugs

Please see the V1 version of the old version

Project comparison

Aishell-1 training results:

Offline results

Name	Parameter quantity	Chinese CER	Number of training rounds	online/offline	Test data	Decoding method
Wenet(Conformer)	9.5M	6.48%	100	Offline offline	aishell1-test	ctc_greeedy
Wenet(transformer)	9.7M	8.68%	100	Offline offline	aishell1-test	ctc_greeedy
Wenet(Paraformer)	9.0M	6.99%	100	Offline offline	aishell1-test	paraformer_greeedy
FunASR (Paraformer)	9.5M	6.37%	100	Offline offline	aishell1-test	paraformer_greeedy
FunASR(Conformer)	9.5M	6.64%	100	Offline offline	aishell1-test	ctc_greeedy
FunASR(e_branchformer)	10.1M	6.65%	100	Offline offline	aishell1-test	ctc_greeedy
repo(ConformerCTC)	10.1M	6.8%	100	Offline offline	aishell1-test	ctc_greeedy

Streaming results

Name	Parameter quantity	Chinese CER	Number of training rounds	online/offline	Test data	Decoding method
Wenet(U2++Conformer)	10.6M	8.18%	100	Online	aishell1-test	ctc_greeedy
Wenet(U2++ transformer)	10.3M	9.88%	100	Online	aishell1-test	ctc_greeedy
repo(StreamingConformerCTC)	10.1M	7.2%	100	Online	aishell1-test	ctc_greeedy
repo(ChunkConformer)	10.7M	8.9%	100	Online	aishell1-test	ctc_greeedy

Implement functions

VAD+Noise Reduction
Online streaming recognition/offline recognition
Punctuation recovery
TTS data enhancement
Tone conversion data enhancement
Far near field data enhancement

Other projects

TTS: https://github.com/Z-yq/TensorflowTTS

NLU: -

BOT: -

TTS data enhancement system

Without data, you can achieve a certain level of ASR effect.

TTS for ASR: The training data are aishell1 and aishell3, and the data type is more suitable for ASR.

Tips:

There are 500 sounds in total
Only supported in Chinese
If the text to be synthesized has punctuation marks, please remove them manually
If you want to add pauses, add sil in the middle of the text

step1: Prepare a list of text to be synthesized, if named text.list, egs:

这是第一句话
这是第二句话
这是一句sil有停顿的话
...

step2: Download the model

Link: https://pan.baidu.com/s/1deN1PmJ4olkRKw8ceQrUNA Extraction code: c0tp

Both need to be downloaded and put into the directory./augmentations/tts_for_asr/models

step3: Then run the script in the root directory:

 python . / augmentations / tts_for_asr / tts_augment . py - f text . list - o save_dir - - voice_num 10 - - vc_num 3

in:

-f is the list prepared by step1

-o The corpus path used to save the synthetic, the recommended absolute path.

--voice_num How many tones are used to synthesize each sentence

--vc_num How many times can be enhanced with tone conversion per sentence

After the run is completed, the wavs directory and utterance.txt will be generated under -o

Mel Layer

Referring to the librosa library, the layer of speech spectrum feature extraction is implemented using TF2.

Or you can use Leaf with a smaller amount of parameter.

use:

am_data.yml

 mel_layer_type: Melspectrogram #Spectrogram/leaf
trainable_kernel: True #support train model,not recommend

Cpp Inference

The CPP project based on ONNX has been updated.

See CppInference ONNX for details

Python Inference

python inference scheme based on ONNX, see python inference for details

Streaming Conformer

Now supports streaming Conformer structure.

There are currently two ways to implement:

Block Conformer + Global CTC
- Can be used for short-term identification systems with VAD, global CTC to build context information.

Chunk Conformer + CTC Picker

Referring to Baidu's SMLTA2, we first use phoneme CTC to sample the effective feature, and then use the chunk conformer of the lookahead to construct context information to make predictions. Can be used in long-term streaming recognition systems.

Pretrained Model

All results were tested on the AISHELL TEST dataset.

RTF (real-time rate) is tested on CPU single-core decoding tasks.

AM:

Model Name	Mel layer(USE/TRAIN)	link	code	train data	phoneme CER(%)	Params Size	RTF
ConformerCTC(S)	True/False	pan.baidu.com/s/1k6miY1yNgLrT0cB-xsqqag	8s53	aishell-1(50 epochs)	6.4	10M	0.056
StreamingConformerCTC	True/False	pan.baidu.com/s/1Rc0x7LOiExaAC0GNhURkHw	zwh9	aishell-1(50 epochs)	7.2	15M	0.08
ChunkConformer	True/False	pan.baidu.com/s/1o_x677WUyWNld-8sNbydxg	ujmg	aishell-1(50 epochs)	11.4	15M	0.1

VAD:

Model Name	link	code	train data	params size	RTF
8k_online_vad	pan.baidu.com/s/1ag9VwTxIqW4C2AgF-6nIgg	ofc9	openslr open source data	80K	0.0001

Punc:

Model Name	link	code	train data	acc	params size	RTF
PuncModel	pan.baidu.com/s/1gtvRKYIE2cAbfiqBn9bhaw	515t	NLP open source data	95%	600K	0.0001

use:

Convert model into onnx file in test_asr.py and put it into pythonInference

Community

Welcome to join, discuss and share issues. If you have more than 200 people in the group, please add the note "TensorflowASR".

What's New?

Latest updates

? [2023.04.18] Updated the Chunk Conformer structure, suitable for long-term streaming ASR scenarios.

Supported Structure

CTC + Streaming

Supported Models

Conformer
BlockConformer
ChunkConformer

Requirements

Python 3.6+
Tensorflow 2.8+: pip install tensorflow-gpu 可以参考https://www.bilibili.com/read/cv14876435
librosa
pypinyin if you need use the default phoneme
keras-bert
addons For LAS structure,pip install tensorflow-addons
tqdm
tf2onnx
rir_generator pip install rir-generator
onnxruntime pip install onnxruntime or pip install onnxruntime-gpu

Usage

Prepare train_list and test_list.

asr_train_list format, where 't' is tap, it is recommended to write to a text file using the program, the path +'t' + text

 wav_path = "xxx/xx/xx/xxx.wav"
wav_label = "这是个例子"
with open ( 'train.list' , 'w' , encoding = 'utf-8' ) as f :
  f . write ( wav_path + ' t ' + wav_label + ' n ' ) :

For example, the train.list obtained:

 /opt/data/test.wav	这个是一个例子
......

The following is the training data preparation format (not required) for VAD and punctuation recovery:

vad_train_list format:

 wav_path1
wav_path2
……

For example:

 /opt/data/test.wav

The internal processing logic of vad training relies on energy as training samples, so make sure that the training corpus you prepare is recorded in quiet conditions.

punc_train_list format:

 text1
 text2
 ……

In the same format as LM, the text on each line contains punctuation. Currently, punctuation only supports each word followed by one punctuation. Continuous punctuation is considered invalid.

for example:

这是：一个例子哦。 √(正确格式）

这是：“一个例子哦”。 ×(错误格式）

这是：一个例子哦“。 ×(错误格式）

Download bert's pre-trained model for assisted training of punctuation recovery models. If you do not need punctuation recovery, you can skip:
```
 https://pan.baidu.com/s/1_HDAhfGZfNhXS-cYoLQucA extraction code: 4hsa
```
Modify the configuration file am_data.yml (./asr/configs) to set some training options, and modify name parameter in model yaml (such as: ./asr/configs/conformer.yml) to select the model structure.

Then execute the command:

python train_asr.py --data_config ./asr/configs/am_data.yml --model_config ./asr/configs/ConformerS.yml

When you want to test, you can refer to the demo written in ./test_asr.py . Of course, you can modify the stt method to adapt to your needs:
```
 python . / test_asr . py  
```

You can also use Tester to test data in large quantities to verify your model performance:

implement:

python eval_am.py --data_config ./asr/configs/am_data.yml --model_config ./asr/configs/ConformerS.yml

This script will show several indicators of SER/CER/DEL/INS/SUB

6. To train VAD or punctuation recovery model, please refer to the above steps.

Tips

If you want to use your own phoneme, you need the corresponding conversion method in am_dataloader.py .

 def init_text_to_vocab ( self ): #keep the name
    
    def text_to_vocab_func ( txt ):
        return your_convert_function

    self . text_to_vocab = text_to_vocab_func #here self.text_to_vocab is a function,not a call

Don't forget to start with <S> and </S> , eg:

    <S>
    </S>
    de
    shì
    ……

References

Refer to the following excellent projects:

https://github.com/usimarit/TiramisuASR

https://github.com/noahchalifour/warp-transducer

https://github.com/PaddlePaddle/DeepSpeech

https://github.com/baidu-research/warp-ctc

Licence

We allow and thank you for using this project for academic research, commercial product production, etc., but it is prohibited to trade this project as a commodity.

Overall, Almost models here are licensed under the Apache 2.0 for all countries in the world.

Allow and thank you for using this project for academic research, commercial product production, allowing unrestricted commercial and non-commercial use alike.

However, it is prohibited to trade this project as a commodity.

Expand

Additional Information