MASR is a Chinese Mandarin speech recognition project based on end-to-end deep neural networks .
MASR uses a gated convolutional neural network, with a network structure similar to the Wav2letter proposed by Facebook in 2016. But the activation function used is not ReLU or HardTanh , but GLU (gated linear unit). Therefore, it is called a gated convolutional network. According to my experiments, using GLU convergence is faster than HardTanh . If you want to study the effects of convolutional networks for speech recognition, this project can be used as a reference.
The following is the word error rate CER to measure the performance of the model. CER = Edit distance/Sentence length, the lower the better.
It can be roughly understood as 1 - CER is the recognition accuracy.
The model was trained using the AISHELL-1 dataset, with a total of 150 hours of recording, covering more than 4,000 Chinese characters. Voice recognition systems used in the industry usually use at least 10 times the recording data of this project to train language models , and do not expect this project to be comparable to the recognition effect of the industry. This is not realistic for any individual project on Github, unless more advanced technology is born.
What is a language model for corpus training for specific scenarios? For example, when you use voice recognition in the game, it tends to recognize your words as what you may say when playing the game, such as "Diao Chan was beaten to death by Lan". In other scenes, "Diao Chan was beaten to death by Lan" is not a smooth sentence at all. If you don’t believe it, you will say to someone who has only read Romance of the Three Kingdoms and never played Honor of Kings, "Diao Chan was beaten to death by Lan." You are sure that he will not ask you back: "What? Who was Diao Chan killed by? Who is Lan?"
On the single card GTX 1080Ti, it takes about 20 minutes for the model to iterate one epoch. (The CUDA version of the laboratory is relatively low, and it is not ruled out that it will be faster after updating the CUDA version.)
The above figure shows the training curve of the CER with epoch of the verification set. It can be seen that the verification set CER has dropped to 11%.
The performance of the test set is not shown in the figure. The CER of the test set is slightly higher, at 14%.
The CER of the test set can be reduced to 8% through the external language model.
The pre-trained model currently provided by the project has been trained for about 100 epochs, which is almost the best.