TensorflowASR
The end-to-end speech recognition model based on the Conformer version of Tensorflow 2, and the CPU's RTF (real-time rate) is around 0.1
The current branch is V2 version, which is a CTC+translate structure
Welcome to use and feedback bugs
Please see the V1 version of the old version
Aishell-1 training results:
Offline results
| Name | Parameter quantity | Chinese CER | Number of training rounds | online/offline | Test data | Decoding method |
|---|---|---|---|---|---|---|
| Wenet(Conformer) | 9.5M | 6.48% | 100 | Offline offline | aishell1-test | ctc_greeedy |
| Wenet(transformer) | 9.7M | 8.68% | 100 | Offline offline | aishell1-test | ctc_greeedy |
| Wenet(Paraformer) | 9.0M | 6.99% | 100 | Offline offline | aishell1-test | paraformer_greeedy |
| FunASR (Paraformer) | 9.5M | 6.37% | 100 | Offline offline | aishell1-test | paraformer_greeedy |
| FunASR(Conformer) | 9.5M | 6.64% | 100 | Offline offline | aishell1-test | ctc_greeedy |
| FunASR(e_branchformer) | 10.1M | 6.65% | 100 | Offline offline | aishell1-test | ctc_greeedy |
| repo(ConformerCTC) | 10.1M | 6.8% | 100 | Offline offline | aishell1-test | ctc_greeedy |
Streaming results
| Name | Parameter quantity | Chinese CER | Number of training rounds | online/offline | Test data | Decoding method |
|---|---|---|---|---|---|---|
| Wenet(U2++Conformer) | 10.6M | 8.18% | 100 | Online | aishell1-test | ctc_greeedy |
| Wenet(U2++ transformer) | 10.3M | 9.88% | 100 | Online | aishell1-test | ctc_greeedy |
| repo(StreamingConformerCTC) | 10.1M | 7.2% | 100 | Online | aishell1-test | ctc_greeedy |
| repo(ChunkConformer) | 10.7M | 8.9% | 100 | Online | aishell1-test | ctc_greeedy |
TTS: https://github.com/Z-yq/TensorflowTTS
NLU: -
BOT: -
Without data, you can achieve a certain level of ASR effect.
TTS for ASR: The training data are aishell1 and aishell3, and the data type is more suitable for ASR.
Tips:
There are 500 sounds in total
Only supported in Chinese
If the text to be synthesized has punctuation marks, please remove them manually
If you want to add pauses, add sil in the middle of the text
step1: Prepare a list of text to be synthesized, if named text.list, egs:
这是第一句话
这是第二句话
这是一句sil有停顿的话
...
step2: Download the model
Link: https://pan.baidu.com/s/1deN1PmJ4olkRKw8ceQrUNA Extraction code: c0tp
Both need to be downloaded and put into the directory./augmentations/tts_for_asr/models
step3: Then run the script in the root directory:
python . / augmentations / tts_for_asr / tts_augment . py - f text . list - o save_dir - - voice_num 10 - - vc_num 3in:
-f is the list prepared by step1
-o The corpus path used to save the synthetic, the recommended absolute path.
--voice_num How many tones are used to synthesize each sentence
--vc_num How many times can be enhanced with tone conversion per sentence
After the run is completed, the wavs directory and utterance.txt will be generated under -o
Referring to the librosa library, the layer of speech spectrum feature extraction is implemented using TF2.
Or you can use Leaf with a smaller amount of parameter.
use:
mel_layer_type: Melspectrogram #Spectrogram/leaf
trainable_kernel: True #support train model,not recommend
The CPP project based on ONNX has been updated.
See CppInference ONNX for details
python inference scheme based on ONNX, see python inference for details
Now supports streaming Conformer structure.
There are currently two ways to implement:
Block Conformer + Global CTC
Chunk Conformer + CTC Picker
All results were tested on the AISHELL TEST dataset.
RTF (real-time rate) is tested on CPU single-core decoding tasks.
AM:
| Model Name | Mel layer(USE/TRAIN) | link | code | train data | phoneme CER(%) | Params Size | RTF |
|---|---|---|---|---|---|---|---|
| ConformerCTC(S) | True/False | pan.baidu.com/s/1k6miY1yNgLrT0cB-xsqqag | 8s53 | aishell-1(50 epochs) | 6.4 | 10M | 0.056 |
| StreamingConformerCTC | True/False | pan.baidu.com/s/1Rc0x7LOiExaAC0GNhURkHw | zwh9 | aishell-1(50 epochs) | 7.2 | 15M | 0.08 |
| ChunkConformer | True/False | pan.baidu.com/s/1o_x677WUyWNld-8sNbydxg | ujmg | aishell-1(50 epochs) | 11.4 | 15M | 0.1 |
VAD:
| Model Name | link | code | train data | params size | RTF |
|---|---|---|---|---|---|
| 8k_online_vad | pan.baidu.com/s/1ag9VwTxIqW4C2AgF-6nIgg | ofc9 | openslr open source data | 80K | 0.0001 |
Punc:
| Model Name | link | code | train data | acc | params size | RTF |
|---|---|---|---|---|---|---|
| PuncModel | pan.baidu.com/s/1gtvRKYIE2cAbfiqBn9bhaw | 515t | NLP open source data | 95% | 600K | 0.0001 |
use:
Convert model into onnx file in test_asr.py and put it into pythonInference
Welcome to join, discuss and share issues. If you have more than 200 people in the group, please add the note "TensorflowASR".

Latest updates
pip install tensorflow-gpu 可以参考https://www.bilibili.com/read/cv14876435if you need use the default phonemeFor LAS structure,pip install tensorflow-addonspip install rir-generatorpip install onnxruntime or pip install onnxruntime-gpu Prepare train_list and test_list.
asr_train_list format, where 't' is tap, it is recommended to write to a text file using the program, the path +'t' + text
wav_path = "xxx/xx/xx/xxx.wav"
wav_label = "这是个例子"
with open ( 'train.list' , 'w' , encoding = 'utf-8' ) as f :
f . write ( wav_path + ' t ' + wav_label + ' n ' ) :For example, the train.list obtained:
/opt/data/test.wav 这个是一个例子
......
The following is the training data preparation format (not required) for VAD and punctuation recovery:
vad_train_list format:
wav_path1
wav_path2
……
For example:
/opt/data/test.wav
The internal processing logic of vad training relies on energy as training samples, so make sure that the training corpus you prepare is recorded in quiet conditions.
punc_train_list format:
text1
text2
……
In the same format as LM, the text on each line contains punctuation. Currently, punctuation only supports each word followed by one punctuation. Continuous punctuation is considered invalid.
for example:
这是:一个例子哦。 √(正确格式)
这是:“一个例子哦”。 ×(错误格式)
这是:一个例子哦“。 ×(错误格式)
Download bert's pre-trained model for assisted training of punctuation recovery models. If you do not need punctuation recovery, you can skip:
https://pan.baidu.com/s/1_HDAhfGZfNhXS-cYoLQucA extraction code: 4hsa
Modify the configuration file am_data.yml (./asr/configs) to set some training options, and modify name parameter in model yaml (such as: ./asr/configs/conformer.yml) to select the model structure.
Then execute the command:
python train_asr.py --data_config ./asr/configs/am_data.yml --model_config ./asr/configs/ConformerS.yml When you want to test, you can refer to the demo written in ./test_asr.py . Of course, you can modify the stt method to adapt to your needs:
python . / test_asr . py You can also use Tester to test data in large quantities to verify your model performance:
implement:
python eval_am.py --data_config ./asr/configs/am_data.yml --model_config ./asr/configs/ConformerS.ymlThis script will show several indicators of SER/CER/DEL/INS/SUB
6. To train VAD or punctuation recovery model, please refer to the above steps.
If you want to use your own phoneme, you need the corresponding conversion method in am_dataloader.py .
def init_text_to_vocab ( self ): #keep the name
def text_to_vocab_func ( txt ):
return your_convert_function
self . text_to_vocab = text_to_vocab_func #here self.text_to_vocab is a function,not a call Don't forget to start with <S> and </S> , eg:
<S>
</S>
de
shì
……
Refer to the following excellent projects:
https://github.com/usimarit/TiramisuASR
https://github.com/noahchalifour/warp-transducer
https://github.com/PaddlePaddle/DeepSpeech
https://github.com/baidu-research/warp-ctc
We allow and thank you for using this project for academic research, commercial product production, etc., but it is prohibited to trade this project as a commodity.
Overall, Almost models here are licensed under the Apache 2.0 for all countries in the world.
Allow and thank you for using this project for academic research, commercial product production, allowing unrestricted commercial and non-commercial use alike.
However, it is prohibited to trade this project as a commodity.