Deepang Raval 1 | Vyom Pathak 1 | Muktan Patel 1 | Brijesh Bhatt 1
佛法大学(Dharmsinh Desai University )
我们提出了一种新颖的方法,用于提高古吉拉特语语言的端到端语音识别系统的性能。我们遵循一种基于深度学习的方法,该方法包括卷积神经网络(CNN),双向长期记忆(BilstM)层,密集层和连接派时间分类(CTC)作为损失函数。为了提高数据集大小的系统性能,我们提出了一个基于基于变形金刚(BERT)基于变形金刚(BERT)后处理技术的基于基于的语言模型(WLM和CLM)的前缀解码技术和双向编码器表示。为了从我们的自动语音识别(ASR)系统中获得关键见解,我们提出了不同的分析方法。这些见解有助于根据特定语言(古吉拉特语)理解我们的ASR系统,并可以控制ASR系统以改善低资源语言的性能。我们已经在Microsoft语音语料库上训练了该模型,并且相对于基本模型WER,单词错误率(WER)降低了5.11%。
如果您觉得这项工作有用,请使用以下Bibtex引用这项工作:
@inproceedings { raval-etal-2020-end ,
title = " End-to-End Automatic Speech Recognition for {G}ujarati " ,
author = " Raval, Deepang and
Pathak, Vyom and
Patel, Muktan and
Bhatt, Brijesh " ,
booktitle = " Proceedings of the 17th International Conference on Natural Language Processing (ICON) " ,
month = dec,
year = " 2020 " ,
address = " Indian Institute of Technology Patna, Patna, India " ,
publisher = " NLP Association of India (NLPAI) " ,
url = " https://aclanthology.org/2020.icon-main.56 " ,
pages = " 409--419 " ,
abstract = "We present a novel approach for improving the performance of an End-to-End speech recognition system for the Gujarati language. We follow a deep learning based approach which includes Convolutional Neural Network (CNN), Bi-directional Long Short Term Memory (BiLSTM) layers, Dense layers, and Connectionist Temporal Classification (CTC) as a loss function. In order to improve the performance of the system with the limited size of the dataset, we present a combined language model (WLM and CLM) based prefix decoding technique and Bidirectional Encoder Representations from Transformers (BERT) based post-processing technique. To gain key insights from our Automatic Speech Recognition (ASR) system, we proposed different analysis methods. These insights help to understand our ASR system based on a particular language (Gujarati) as well as can govern ASR systems{'} to improve the performance for low resource languages. We have trained the model on the Microsoft Speech Corpus, and we observe a 5.11{%} decrease in Word Error Rate (WER) with respect to base-model WER.",
}git clone https://github.com/01-vyom/End_2_End_Automatic_Speech_Recognition_For_Gujarati.git
python -m venv asr_env
source $PWD /asr_env/bin/activate将目录更改为存储库的根。
pip install --upgrade pip
pip install -r requirements.txt将目录更改为存储库的根。
要在论文中训练模型,请运行此命令:
python ./Train/train.py笔记:
PathDataAudios和PathDataTranscripts ,以指向音频文件的适当路径和trascript文件的路径。currmodel ,以更改所保存的模型名称。使用经过训练的模型进行推理,运行:
python ./Eval/inference.py笔记:
PathDataAudios和PathDataTranscripts ,以指向音频文件的适当路径,并指向trascript文件进行测试的路径。model ,然后更改用于测试的文件名,请更改test_data变量。.pickle文献和假设./Eval/ 要解码推断的输出,请运行:
python ./Eval/decode.py笔记:
.pickle更改model变量。./Eval/中,该模型具有所有类型的解码和实际文本的模型。 要进行后处理解码的输出,请遵循此读数中提到的步骤。
要执行系统分析,请运行:
python ./System Analysis/system_analysis.py笔记:
要选择特定于模型的解码.csv文件进行分析,请更改model变量。
要选择特定类型的列(假设类型)进行分析,请更改type变量。输出文件将保存在./System Analysis/特定于模型和解码类型中。
我们的算法实现了以下性能:
| 技术名称 | 减少(%) |
|---|---|
| LMS的前缀 | 2.42 |
| LMS的前缀 | 5.11 |
笔记:
前缀解码代码基于1和2个开源实现。基于BERT的咒语校正器的代码是根据此开源实现的
根据MIT许可获得许可。