Deepang Raval 1 | Vyom Pathak 1 | Muktan Patel 1 | Brijesh Bhatt 1
佛法大學(Dharmsinh Desai University )
我們提出了一種新穎的方法,用於提高古吉拉特語語言的端到端語音識別系統的性能。我們遵循一種基於深度學習的方法,該方法包括卷積神經網絡(CNN),雙向長期記憶(BilstM)層,密集層和連接派時間分類(CTC)作為損失函數。為了提高數據集大小的系統性能,我們提出了一個基於基於變形金剛(BERT)基於變形金剛(BERT)後處理技術的基於基於的語言模型(WLM和CLM)的前綴解碼技術和雙向編碼器表示。為了從我們的自動語音識別(ASR)系統中獲得關鍵見解,我們提出了不同的分析方法。這些見解有助於根據特定語言(古吉拉特語)理解我們的ASR系統,並可以控制ASR系統以改善低資源語言的性能。我們已經在Microsoft語音語料庫上訓練了該模型,並且相對於基本模型WER,單詞錯誤率(WER)降低了5.11%。
如果您覺得這項工作有用,請使用以下Bibtex引用這項工作:
@inproceedings { raval-etal-2020-end ,
title = " End-to-End Automatic Speech Recognition for {G}ujarati " ,
author = " Raval, Deepang and
Pathak, Vyom and
Patel, Muktan and
Bhatt, Brijesh " ,
booktitle = " Proceedings of the 17th International Conference on Natural Language Processing (ICON) " ,
month = dec,
year = " 2020 " ,
address = " Indian Institute of Technology Patna, Patna, India " ,
publisher = " NLP Association of India (NLPAI) " ,
url = " https://aclanthology.org/2020.icon-main.56 " ,
pages = " 409--419 " ,
abstract = "We present a novel approach for improving the performance of an End-to-End speech recognition system for the Gujarati language. We follow a deep learning based approach which includes Convolutional Neural Network (CNN), Bi-directional Long Short Term Memory (BiLSTM) layers, Dense layers, and Connectionist Temporal Classification (CTC) as a loss function. In order to improve the performance of the system with the limited size of the dataset, we present a combined language model (WLM and CLM) based prefix decoding technique and Bidirectional Encoder Representations from Transformers (BERT) based post-processing technique. To gain key insights from our Automatic Speech Recognition (ASR) system, we proposed different analysis methods. These insights help to understand our ASR system based on a particular language (Gujarati) as well as can govern ASR systems{'} to improve the performance for low resource languages. We have trained the model on the Microsoft Speech Corpus, and we observe a 5.11{%} decrease in Word Error Rate (WER) with respect to base-model WER.",
}git clone https://github.com/01-vyom/End_2_End_Automatic_Speech_Recognition_For_Gujarati.git
python -m venv asr_env
source $PWD /asr_env/bin/activate將目錄更改為存儲庫的根。
pip install --upgrade pip
pip install -r requirements.txt將目錄更改為存儲庫的根。
要在論文中訓練模型,請運行此命令:
python ./Train/train.py筆記:
PathDataAudios和PathDataTranscripts ,以指向音頻文件的適當路徑和trascript文件的路徑。currmodel ,以更改所保存的模型名稱。使用經過訓練的模型進行推理,運行:
python ./Eval/inference.py筆記:
PathDataAudios和PathDataTranscripts ,以指向音頻文件的適當路徑,並指向trascript文件進行測試的路徑。model ,然後更改用於測試的文件名,請更改test_data變量。.pickle文獻和假設./Eval/ 要解碼推斷的輸出,請運行:
python ./Eval/decode.py筆記:
.pickle更改model變量。./Eval/中,該模型具有所有類型的解碼和實際文本的模型。 要進行後處理解碼的輸出,請遵循此讀數中提到的步驟。
要執行系統分析,請運行:
python ./System Analysis/system_analysis.py筆記:
要選擇特定於模型的解碼.csv文件進行分析,請更改model變量。
要選擇特定類型的列(假設類型)進行分析,請更改type變量。輸出文件將保存在./System Analysis/特定於模型和解碼類型中。
我們的算法實現了以下性能:
| 技術名稱 | 減少(%) |
|---|---|
| LMS的前綴 | 2.42 |
| LMS的前綴 | 5.11 |
筆記:
前綴解碼代碼基於1和2個開源實現。基於BERT的咒語校正器的代碼是根據此開源實現的
根據MIT許可獲得許可。