
该存储库包含Pytorch-Kaldi工具包(Pytorch-Kaldi-v1.0)的最后版本。要查看以前的版本(Pytorch-Kaldi-v0.1),请单击此处。
如果您使用此代码或部分代码,请引用以下论文:
M. Ravanelli,T。Parcollet,Y。Bengio,“ Pytorch-Kaldi语音识别工具包”,Arxiv
@inproceedings{pytorch-kaldi,
title = {The PyTorch-Kaldi Speech Recognition Toolkit},
author = {M. Ravanelli and T. Parcollet and Y. Bengio},
booktitle = {In Proc. of ICASSP},
year = {2019}
}
该工具包在创意共享归因4.0国际许可下发布。您可以复制,分发,修改代码,以进行研究,商业和非商业目的。我们只要求引用上面引用的论文。
为了提高语音识别结果的透明度和可复制性,我们为用户提供了在此存储库中发布其Pytorch-Kaldi模型的可能性。为此随时与我们联系(或提取请求)。此外,如果您的论文使用Pytorch-Kaldi,也可以在此存储库中进行宣传。
在Pytorch-Kaldi工具包上查看一个简短的介绍视频
我们很高兴地宣布,语音脑项目(https://speechbrain.github.io/)现在是公开的!我们强烈鼓励用户迁移到语音脑。这是一个更好的项目,已经支持多个语音处理任务,例如语音识别,扬声器识别,SLU,语音增强,语音分离,多微米光音信号处理等。
目的是开发一个单一,灵活且用户友好的工具包,可用于轻松开发语音识别的最新语音系统(端到端和HMM-DNN),扬声器识别,语音分离,多微型微音信号处理(EG,束成绩),自我监督的学习以及许多其他。
该项目将由Mila领导,并由三星,Nvidia,Dolby赞助。 SpeechBrain还将受益于其他公司的合作和专业知识,例如Facebook/Pytorch,IbMresearch,Fluentai。
我们正在积极寻找合作者。如果您有兴趣合作,请随时通过[email protected]与我们联系。
多亏了我们的赞助商,我们也能够聘请在MILA工作的实习生,从事演讲脑项目。理想的候选人是一名具有Pytorch和Speech Technologies经验的博士生(将您的简历发送到[email protected])
言语脑的发展将需要几个月才能拥有工作存储库。同时,我们将继续为Pytorch-Kaldi项目提供支持。
敬请关注!
Pytorch-Kaldi项目旨在弥合Kaldi和Pytorch工具包之间的差距,试图继承Kaldi的效率和Pytorch的灵活性。 Pytorch-Kaldi不仅是这些工具包之间的简单接口,而且还嵌入了开发现代语音识别器的几个有用功能。例如,该代码专门设计用于自然插入用户定义的声学模型。作为替代方案,用户可以利用可以使用直观配置文件自定义的几个预知神经网络。 Pytorch-Kaldi支持多个功能和标签流以及神经网络的组合,从而实现了复杂的神经体系结构的使用。该工具包与丰富的文档一起公开发行,旨在在本地或HPC群集上正常工作。
新版本的Pytorch-Kaldi工具包的某些功能:
export KALDI_ROOT=/home/mirco/kaldi-trunk
PATH=$PATH:$KALDI_ROOT/tools/openfst
PATH=$PATH:$KALDI_ROOT/src/featbin
PATH=$PATH:$KALDI_ROOT/src/gmmbin
PATH=$PATH:$KALDI_ROOT/src/bin
PATH=$PATH:$KALDI_ROOT//src/nnetbin
export PATH
切记使用路径更改Kaldi_root变量。作为检查安装的第一个测试,打开bash壳,键入“复制 - 特征”或“ hmm-info”,并确保不出现错误。
如果尚未完成,请安装Pytorch(http://pytorch.org/)。我们在Pytorch 1.0和Pytorch 0.4上测试了我们的代码。 Pytorch的较旧版本可能会引起错误。要检查您的安装,请键入“ Python”,然后输入控制台,键入“导入火炬”,并确保不出现错误。
我们建议在GPU机器上运行代码。确保已安装并正确工作的CUDA库(https://developer.nvidia.com/cuda-downloads)。我们在CUDA 9.0、9.1和8.0上测试了系统。确保安装Python(代码使用Python 2.7和Python 3.7进行了测试)。即使不是强制性的,我们建议使用anaconda(https://anaconda.org/anaconda/python)。
2019年2月19日:更新:
batch_size_train = 128*12 | 64*10 | 32*2
上面的线意味着:做12个具有128个批次的时期,10个带有64批批次的时代和2个带有32批批次的时期。类似的形式主义可以用于学习率和辍学计划。有关更多信息,请参见本节。
2019年2月5日:更新:
下一个版本的注释:在下一个版本中,我们计划进一步扩展工具包的功能,并支持更多的模型和功能格式。目的是使我们的工具包适用于其他与语音相关的任务,例如端到端的语音识别,扬声器识别,关键字发现,语音分离,语音活动检测,语音增强等。
要安装Pytorch-Kaldi,请执行以下步骤:
git clone https://github.com/mravanelli/pytorch-kaldi
pip install -r requirements.txt
在下文中,我们根据流行的Timit数据集提供了Pytorch-Kaldi工具包的简短教程。
确保您有timit数据集。如果没有,可以从LDC网站(https://catalog.ldc.upenn.edu/ldc93s1)下载。
确保Kaldi和Pytorch装置很好。还要确保您的卡尔迪路径目前正在工作(您应该按照“先决条件”部分的报道,将kaldi路径添加到.bashrc中)。例如,键入“复制功能”和“ hmm-info”,并确保不出现错误。
运行Timit的Kaldi S5基线。此步骤是计算功能和后来用于训练Pytorch神经网络的标签所必需的。我们建议运行完整的Timit S5食谱(包括DNN培训):
cd kaldi/egs/timit/s5
./run.sh
./local/nnet/run_dnn.sh
这样可以创建所有必要的文件,并且用户可以直接将Kaldi获得的结果与我们的工具包进行比较。
steps/align_fmllr.sh --nj 4 data/dev data/lang exp/tri3 exp/tri3_ali_dev
steps/align_fmllr.sh --nj 4 data/test data/lang exp/tri3 exp/tri3_ali_test
如果您想使用DNN对齐(如建议),请类型:
steps/nnet/align.sh --nj 4 data-fmllr-tri3/train data/lang exp/dnn4_pretrain-dbn_dnn exp/dnn4_pretrain-dbn_dnn_ali
steps/nnet/align.sh --nj 4 data-fmllr-tri3/dev data/lang exp/dnn4_pretrain-dbn_dnn exp/dnn4_pretrain-dbn_dnn_ali_dev
steps/nnet/align.sh --nj 4 data-fmllr-tri3/test data/lang exp/dnn4_pretrain-dbn_dnn exp/dnn4_pretrain-dbn_dnn_ali_test
我们通过非常简单的MLP网络启动本教程,该网络训练了MFCC功能。在启动实验之前,请查看配置文件CFG/TIMIT_BASELINES/TIMIT_MLP_MFCC_BASIC.CFG 。有关其所有字段的详细说明,请参见配置文件的描述。
根据您的路径更改配置文件。尤其:
为了避免错误,请确保存在CFG文件中的所有路径。请避免使用包含bash变量的路径,因为从字面上读取路径并且不会自动扩展(例如,use/home/home/home/kaldi-trunk/egs/timit/s5/exp/exp/dnn4_pretrain-dbn_dnn_ali,而不是$ kaldi_root/egs/egs/s5/s5/s5/pration_
python run_exp.py cfg/TIMIT_baselines/TIMIT_MLP_mfcc_basic.cfg
该脚本启动了一个完整的ASR实验,并执行培训,验证,前进和解码步骤。进度条显示了上述所有阶段的演变。脚本run_exp.py逐渐在输出目录中创建以下文件:
请注意,您可以随时停止实验。如果再次运行脚本,它将自动从正确处理的最后一个块开始。根据可用的GPU,培训可能需要几个小时。还要注意,如果您想更改配置文件的某些参数(例如,n_chunks =,fea_lst =,batch_size_train =,..),则必须指定其他输出文件夹(output_folder =)。
调试:如果您遇到一些错误,我们建议进行以下检查:
看看标准输出。
如果没有帮助,请查看log.log文件。
查看run_nn的函数。到core.py库。在功能的各个部分中添加一些打印件,以隔离问题并找出问题。
在培训结束时,将电话错误率(Per%)附加到Res.Res文件中。要查看有关解码结果的更多详细信息,您可以进入输出文件夹中的“ decoding_test”,然后查看创建的各种文件。对于此特定示例,我们获得了以下res.res文件:
ep=000 tr=['TIMIT_tr'] loss=3.398 err=0.721 valid=TIMIT_dev loss=2.268 err=0.591 lr_architecture1=0.080000 time(s)=86
ep=001 tr=['TIMIT_tr'] loss=2.137 err=0.570 valid=TIMIT_dev loss=1.990 err=0.541 lr_architecture1=0.080000 time(s)=87
ep=002 tr=['TIMIT_tr'] loss=1.896 err=0.524 valid=TIMIT_dev loss=1.874 err=0.516 lr_architecture1=0.080000 time(s)=87
ep=003 tr=['TIMIT_tr'] loss=1.751 err=0.494 valid=TIMIT_dev loss=1.819 err=0.504 lr_architecture1=0.080000 time(s)=88
ep=004 tr=['TIMIT_tr'] loss=1.645 err=0.472 valid=TIMIT_dev loss=1.775 err=0.494 lr_architecture1=0.080000 time(s)=89
ep=005 tr=['TIMIT_tr'] loss=1.560 err=0.453 valid=TIMIT_dev loss=1.773 err=0.493 lr_architecture1=0.080000 time(s)=88
.........
ep=020 tr=['TIMIT_tr'] loss=0.968 err=0.304 valid=TIMIT_dev loss=1.648 err=0.446 lr_architecture1=0.002500 time(s)=89
ep=021 tr=['TIMIT_tr'] loss=0.965 err=0.304 valid=TIMIT_dev loss=1.649 err=0.446 lr_architecture1=0.002500 time(s)=90
ep=022 tr=['TIMIT_tr'] loss=0.960 err=0.302 valid=TIMIT_dev loss=1.652 err=0.447 lr_architecture1=0.001250 time(s)=88
ep=023 tr=['TIMIT_tr'] loss=0.959 err=0.301 valid=TIMIT_dev loss=1.651 err=0.446 lr_architecture1=0.000625 time(s)=88
%WER 18.1 | 192 7215 | 84.0 11.9 4.2 2.1 18.1 99.5 | -0.583 | /home/mirco/pytorch-kaldi-new/exp/TIMIT_MLP_basic5/decode_TIMIT_test_out_dnn1/score_6/ctm_39phn.filt.sys
每(%)的成就为18.1%。请注意,由于不同机器上的初始化不同,结果可能会有一些可变性。我们认为,对于不同的初始化种子获得的性能(即更改配置文件中的现场种子)对于TIMIT至关重要,因为自然性能的可变性可能会完全隐藏实验证据,这对于TIMIT至关重要。我们注意到TIMIT实验的标准偏差约为0.2%。
如果要更改功能,则必须首先使用Kaldi工具包计算它们。要计算Fbank功能,您必须打开$ kaldi_root/egs/egs/timit/s5/run.sh并使用以下行进行计算:
feadir=fbank
for x in train dev test; do
steps/make_fbank.sh --cmd "$train_cmd" --nj $feats_nj data/$x exp/make_fbank/$x $feadir
steps/compute_cmvn_stats.sh data/$x exp/make_fbank/$x $feadir
done
然后,使用新功能列表更改上述配置文件。如果您已经运行了完整的timit Kaldi食谱,则可以直接在$ kaldi_root/egs/egs/timit/s5/data-fmllr-tri3中找到FMLLR功能。如果您采用此类功能喂养神经网络,则由于采用了扬声器的适应性,因此您应该期望改善性能。
在timit_baseline文件夹中,我们提出了其他可能的timit基线的示例。与上一个示例类似,您可以简单地输入以下操作:
python run_exp.py $cfg_file
有一些示例(timit_rnn*,timit_lstm*,timit_gru*,timit_ligru*)和CNN架构(timit_cnn*)。我们还提出了一个更高级的模型(TIMIT_DNN_LIGRU_DNN_MFCC+FBANK+FMLLR.CFG),在该模型中,我们使用了通过MFCC,Fbank和FMLLR功能的串联喂养的前馈和复发性神经网络的组合。请注意,后一种配置文件对应于参考文件中描述的最佳体系结构。您可能会从上述配置文件中看到,我们通过包括一些技巧,例如单声正则化(即,我们共同估算上下文依赖性和与上下文无关的目标)来改善ASR性能。下表报告通过运行后一个系统获得的结果(平均每%):
| 模型 | MFCC | fbank | fmllr |
|---|---|---|---|
| Kaldi DNN基线 | ------- | ------- | 18.5 |
| MLP | 18.2 | 18.7 | 16.7 |
| RNN | 17.7 | 17.2 | 15.9 |
| Sru | ------- | 16.6 | ------- |
| LSTM | 15.1 | 14.3 | 14.5 |
| gru | 16.0 | 15.2 | 14.9 |
| 李格鲁 | 15.5 | 14.9 | 14.2 |
结果表明,由于扬声器的适应过程,FMLLR具有优于MFCC和FBANKS系数的表现。复发模型的表现大大优于标准MLP ONE,尤其是在使用LSTM,GRU和LI-GRU体系结构时,可以有效解决通过乘法门消失的梯度消失。使用Li-Gru模型[2,3]获得的最佳结果每= $ 14.2 $% ,该模型基于单个门,因此将计算的33%保存在标准GRU上。
最好的结果是通过结合MFCC,FBANK和FMLLR功能的更复杂的体系结构获得的(请参阅CFG/TIMI_BASELINES/TIMIT_MFCC_FBCC_FBANK_FMLLR_LIGRRR_LIGRU_BEST.CFG )。据我们所知,后一个系统实现的Per = 13.8%在TIMIT测试集中表现出了最佳的性能。
简单的复发单元(SRU)是一个有效且高度可行的复发模型。它在ASR上的性能要比标准LSTM,GRU和LI-GRU模型差,但要快得多。 SRU在此处实施,并在以下论文中进行了描述:
T. Lei,Y. Zhang,Si Wang,H。Dai,Y。Artzi,“高度可行的复发的简单复发单元,emnlp 2018。
要使用此模型进行实验,请使用配置文件CFG/Timit_baselines/Timit_sru_fbank.cfg 。在使用pip install sru安装模型之前,您应该在Neural_networks.py中删除“导入SRU”。
您可以直接将您的结果与我们的结果进行比较。在此外部存储库中,您可以找到包含生成文件的所有文件夹。
在LibrisPeech数据集上运行Pytorch-Kaldi的步骤与上面报告的timit相似。以下教程基于100h子集,但可以轻松地将其扩展到完整的数据集(960h)。
mkdir exp/tri4b/decode_tgsmall_train_clean_100 && cp exp/tri4b/trans.* exp/tri4b/decode_tgsmall_train_clean_100/
. ./cmd.sh ## You'll want to change cmd.sh to something that will work on your system.
. ./path.sh ## Source the tools/utils (import the queue.pl)
gmmdir=exp/tri4b
for chunk in train_clean_100 dev_clean test_clean; do
dir=fmllr/$chunk
steps/nnet/make_fmllr_feats.sh --nj 10 --cmd "$train_cmd"
--transform-dir $gmmdir/decode_tgsmall_$chunk
$dir data/$chunk $gmmdir $dir/log $dir/data || exit 1
compute-cmvn-stats --spk2utt=ark:data/$chunk/spk2utt scp:fmllr/$chunk/feats.scp ark:$dir/data/cmvn_speaker.ark
done
# aligments on dev_clean and test_clean
steps/align_fmllr.sh --nj 30 data/train_clean_100 data/lang exp/tri4b exp/tri4b_ali_clean_100
steps/align_fmllr.sh --nj 10 data/dev_clean data/lang exp/tri4b exp/tri4b_ali_dev_clean_100
steps/align_fmllr.sh --nj 10 data/test_clean data/lang exp/tri4b exp/tri4b_ali_test_clean_100
python run_exp.py cfg/Librispeech_baselines/libri_MLP_fmllr.cfg
如果您想使用复发模型,则可以使用libri_rnn_fmllr.cfg , libri_lstm_fmllr.cfg , libri_gru_fmllr.cfg或libri_ligru_fmllr.cfg 。经常性模型的培训可能需要几天(取决于采用的GPU)。下表报告了使用TGSMALL图获得的性能:
| 模型 | 百分比 |
|---|---|
| MLP | 9.6 |
| LSTM | 8.6 |
| gru | 8.6 |
| 李格鲁 | 8.6 |
这些结果是在不添加晶格回复的情况下获得的(即仅使用TGSMALL图)。您可以通过以这种方式添加晶格回复(从Pytorch-Kaldi的Kaldi_decoding_script文件夹中运行)来提高性能:
data_dir=/data/milatmp1/ravanelm/librispeech/s5/data/
dec_dir=/u/ravanelm/pytorch-Kaldi-new/exp/libri_fmllr/decode_test_clean_out_dnn1/
out_dir=/u/ravanelm/pytorch-kaldi-new/exp/libri_fmllr/
steps/lmrescore_const_arpa.sh $data_dir/lang_test_{tgsmall,fglarge}
$data_dir/test_clean $dec_dir $out_dir/decode_test_clean_fglarge || exit 1;
下表报道了使用Rescorking( FGLARGE )获得的最终结果:
| 模型 | 百分比 |
|---|---|
| MLP | 6.5 |
| LSTM | 6.4 |
| gru | 6.3 |
| 李格鲁 | 6.2 |
您可以查看此处获得的结果。
运行ASR实验的主要脚本是run_exp.py 。这个Python脚本执行培训,验证,向前和解码步骤。训练是在几个时期进行的,这些时期逐渐通过所考虑的神经网络处理所有训练材料。每个训练时期后,都会执行验证步骤,以监视持有数据上的系统性能。在训练结束时,前阶段是通过计算指定测试数据集的后验概率来执行的。后验概率通过其先验(使用计数文件)标准化并存储到方舟文件中。然后执行一个解码步骤,以检索说话者在测试句子中说出的最后一系列单词。
run_exp.py脚本将输入输入一个全局配置文件(例如, CFG/Timit_mlp_mfcc.cfg ),该文件指定运行完整实验的所有必要选项。代码run_exp.py调用另一个函数run_nn (请参阅core.py库),该功能在每个数据块上执行培训,验证和正向操作。函数run_nn输入一个特定于块的配置文件(例如,exp/exp/timit_mlp_mfcc/exp_files/triar_timit_tr+timit_dev_ep000_cfg*),指定所有需要运行单次chunk实验的参数。 run_nn函数输出一些信息填充(例如, exp/exp/timit_mlp_mfcc/exp_files/train_timit_tr+timit_dev_ep000_ck00.info ),总结了所处理块的损失和错误。
结果汇总到res.res.res文件中,而错误和警告被重定向到log.log文件中。
配置文件有两种类型(全局和特定于块的CFG文件)。它们都是INI格式的,并通过Python的Configparser库进行读取,处理和修改。全局文件包含几个部分,这些部分指定了语音识别实验的所有主要步骤(培训,验证,向前和解码)。配置文件的结构在原型文件中描述(请参阅实例/grounger.proto ),该文件不仅列出了所有必需的部分和字段,而且还指定了每个可能字段的类型。例如, n_ep = int(1,inf)表示字段N_EP (即训练时期的数量)必须是一个从1到INF的整数。同样, lr = float(0,inf)表示LR场(即学习率)必须是从0到INF的浮点。任何编写不符合这些规格的配置文件的尝试都会引起错误。
现在,让我们尝试打开一个配置文件(例如, CFG/TIMIT_BASELINES/TIMIT_MLP_MFCC_BASIC.CFG ),让我们描述主要部分:
[cfg_proto]
cfg_proto = proto/global.proto
cfg_proto_chunk = proto/global_chunk.proto
配置文件的当前版本首先指定[CFG_PROTO]部分中全局和特定于块的原型文件的路径。
[exp]
cmd =
run_nn_script = run_nn
out_folder = exp/TIMIT_MLP_basic5
seed = 1234
use_cuda = True
multi_gpu = False
save_gpumem = False
n_epochs_tr = 24
[EXP]部分包含一些重要字段,例如输出文件夹( OUT_FOLDER )和特定于块的处理脚本Run_nn的路径(默认情况下,应在core.py库中实现此功能)。字段n_epochs_tr指定所选数量的训练时期。用户可以启用有关使用_cuda,multi_gpu和save_gpumem的其他选项。字段CMD可用于附加命令以在HPC群集上运行脚本。
[dataset1]
data_name = TIMIT_tr
fea = fea_name=mfcc
fea_lst=quick_test/data/train/feats_mfcc.scp
fea_opts=apply-cmvn --utt2spk=ark:quick_test/data/train/utt2spk ark:quick_test/mfcc/train_cmvn_speaker.ark ark:- ark:- | add-deltas --delta-order=2 ark:- ark:- |
cw_left=5
cw_right=5
lab = lab_name=lab_cd
lab_folder=quick_test/dnn4_pretrain-dbn_dnn_ali
lab_opts=ali-to-pdf
lab_count_file=auto
lab_data_folder=quick_test/data/train/
lab_graph=quick_test/graph
n_chunks = 5
[dataset2]
data_name = TIMIT_dev
fea = fea_name=mfcc
fea_lst=quick_test/data/dev/feats_mfcc.scp
fea_opts=apply-cmvn --utt2spk=ark:quick_test/data/dev/utt2spk ark:quick_test/mfcc/dev_cmvn_speaker.ark ark:- ark:- | add-deltas --delta-order=2 ark:- ark:- |
cw_left=5
cw_right=5
lab = lab_name=lab_cd
lab_folder=quick_test/dnn4_pretrain-dbn_dnn_ali_dev
lab_opts=ali-to-pdf
lab_count_file=auto
lab_data_folder=quick_test/data/dev/
lab_graph=quick_test/graph
n_chunks = 1
[dataset3]
data_name = TIMIT_test
fea = fea_name=mfcc
fea_lst=quick_test/data/test/feats_mfcc.scp
fea_opts=apply-cmvn --utt2spk=ark:quick_test/data/test/utt2spk ark:quick_test/mfcc/test_cmvn_speaker.ark ark:- ark:- | add-deltas --delta-order=2 ark:- ark:- |
cw_left=5
cw_right=5
lab = lab_name=lab_cd
lab_folder=quick_test/dnn4_pretrain-dbn_dnn_ali_test
lab_opts=ali-to-pdf
lab_count_file=auto
lab_data_folder=quick_test/data/test/
lab_graph=quick_test/graph
n_chunks = 1
配置文件包含许多部分( [DataSet1] , [DataSet2] , [DataSet3] ,...),它们描述了用于ASR实验的所有语料库。 [数据集*]部分上的字段描述了实验中考虑的所有特征和标签。 The features, for instance, are specified in the field fea: , where fea_name contains the name given to the feature, fea_lst is the list of features (in the scp Kaldi format), fea_opts allows users to specify how to process the features (eg, doing CMVN or adding the derivatives), while cw_left and cw_right set the characteristics of the context window (ie, number of left and right frames to附加)。请注意,Pytorch-Kaldi工具包的当前版本支持多个功能流的定义。确实,如CFG/Timit_baselines/Timit_mfcc_fbank_fmllr_ligru_best.cfg中所示,采用了多个特征流(例如,MFCC,FBANK,FMLLR)。
同样,实验室部分包含一些子场。例如, lab_name是指标签的名称,而lab_folder包含存储卡尔迪配方生成的对齐的文件夹。 lab_opts允许用户在所考虑的对齐方式上指定一些选项。例如, lab_opts =“ ali-to-pdf”提取标准上下文依赖的电话状态标签,而lab_opts = ali-to-to-toone-per-frame = true可以用来提取单声目标。 LAB_COUNT_FILE用于指定包含所考虑电话状态计数的文件。这些计数在前阶段很重要,在远期,神经网络计算的后验概率被其先验分配。 Pytorch-Kaldi允许用户指定外部计数文件或自动检索它(使用lab_count_file = auto )。用户还可以指定LAB_COUNT_FILE =如果不严格需要计数文件,例如,当标签对应于未用于生成前阶段中使用的后验概率的输出时(请参阅CFG/TIMIT_BASELINES中的单声目标/TIMIT_BASELINES/TIMIT_MLP_MLP_MFCC.CFG )。 lab_data_folder ,而是对应于Kaldi数据准备过程中创建的数据文件夹。它包含几个文件,包括最终用于计算最终WER的文本文件。最后一个子场Lab_graph是用于生成标签的Kaldi图的路径。
完整的数据集通常很大,不能适合GPU/RAM内存。因此,应该将其分成几个部分。 Pytorch-Kaldi会自动将数据集拆分为N_Chunks中指定的块数量。块的数量可能取决于特定数据集。通常,我们建议处理大约1或2个小时的语音块(取决于可用的内存)。
[data_use]
train_with = TIMIT_tr
valid_with = TIMIT_dev
forward_with = TIMIT_test
本节讲述了如何在run_exp.py脚本中使用列出的部分[数据集*]的数据。第一行意味着我们使用称为timit_tr的数据进行培训。请注意,此数据集名称必须出现在数据集部分之一中,否则Config Parser将引起错误。同样,第二和第三行分别指定用于验证和正向阶段的数据。
[batches]
batch_size_train = 128
max_seq_length_train = 1000
increase_seq_length_train = False
start_seq_len_train = 100
multply_factor_seq_len_train = 2
batch_size_valid = 128
max_seq_length_valid = 1000
batch_size_train用于定义迷你批次中的训练示例数。字段max_seq_length_train截断句子的时间比指定值更长。当训练很长的句子训练经常性模型时,可能会出现过时的问题。使用此选项,我们允许用户通过截断长句子来减轻此类内存问题。此外,通过设置增加_seq_length_train = true ,可以在训练期间逐步增加最大句子长度。如果启用了训练,则从start_seq_len_train中指定的最大句子长度开始(例如, start_seq_len_train = 100 )。每个时期之后,最大句子长度乘以Multply_factor_seq_len_train(例如Multply_factor_seq_len_train = 2 )。我们已经观察到,这种简单的策略通常会改善系统性能,因为它鼓励该模型首先专注于短期依赖性并仅在以后学习长期依赖性。
同样, batch_size_valid和max_seq_length_valid指定小批次中的示例数,而开发数据集则指定最大长度。
[architecture1]
arch_name = MLP_layers1
arch_proto = proto/MLP.proto
arch_library = neural_networks
arch_class = MLP
arch_pretrain_file = none
arch_freeze = False
arch_seq_model = False
dnn_lay = 1024,1024,1024,1024,N_out_lab_cd
dnn_drop = 0.15,0.15,0.15,0.15,0.0
dnn_use_laynorm_inp = False
dnn_use_batchnorm_inp = False
dnn_use_batchnorm = True,True,True,True,False
dnn_use_laynorm = False,False,False,False,False
dnn_act = relu,relu,relu,relu,softmax
arch_lr = 0.08
arch_halving_factor = 0.5
arch_improvement_threshold = 0.001
arch_opt = sgd
opt_momentum = 0.0
opt_weight_decay = 0.0
opt_dampening = 0.0
opt_nesterov = False
该部分[体系结构*]用于指定ASR实验中涉及的神经网络的体系结构。字段Arch_name指定架构的名称。由于不同的神经网络可以取决于不同的超参数集,因此用户必须添加包含超参数列表的原始文件的路径中的原始文件。例如,标准MLP模型的原型文件包含以下字段:
[proto]
library=path
class=MLP
dnn_lay=str_list
dnn_drop=float_list(0.0,1.0)
dnn_use_laynorm_inp=bool
dnn_use_batchnorm_inp=bool
dnn_use_batchnorm=bool_list
dnn_use_laynorm=bool_list
dnn_act=str_list
与其他原型文件类似,每行定义了具有相关值类型的超参数。原始文件中定义的所有超参数必须出现在相应的[架构*]部分下的全局配置文件中。字段Arch_library指定模型的编码位置(例如Neural_nets.py ),而Arch_class表示实现体系结构的类名称(例如,如果我们设置类= MLP,则我们将从Neal_nets.py import MLP进行操作)。
现场Arch_pretrain_file可用于预先培训神经网络,并具有先前训练的体系结构,而如果您想在训练过程中训练体系结构的参数,则可以将Arch_freeze设置为false ,并应设置为真实,请在训练过程中保持参数(即固定,冷冻)。 “ Arch_seq_model”部分指示该体系结构是顺序的(例如RNN)还是非序列(例如,馈送前向MLP或CNN)。 Pytorch-Kaldi处理输入批次的方式在两种情况下是不同的。对于复发性神经网络( Arch_seq_model = true ),特征的序列不是随机的(以保留序列的元素),而对于feedforward模型( Arch_seq_model = false ),我们随机化了特征(这通常有助于提高性能)。在多个体系结构的情况下,如果至少一种使用的架构标记为顺序( Arch_seq_model = true ),则使用顺序处理。
请注意,以“ Arch_”和“ Opt_”开头的超参数是强制性的,必须在配置文件中指定的所有体系结构中存在。其他超参数(例如,DNN_*)是特定于所考虑的体系结构(它们取决于用户实际实现类MLP的),并且可以定义隐藏层,批处理和图层正常化以及其他参数的数字和类型。其他重要参数与所考虑的体系结构的优化有关。例如, Arch_lr是学习率,而Arch_halving_factor用于实现学习率退火。特别是,当两个连续时期之间的DEV集的相对性能改进小于Arch_improvement_threshold (例如,Arch_improvement_threshold)中指定的时期时,我们将学习率乘以Arch_halving_factor (例如, Arch_halving_factor = 0.5 )。字段Arch_opt指定优化算法的类型。我们目前支持SGD,Adam和RMSProp。其他参数是特定于考虑的优化算法的(有关所有优化特定特定的超参数的确切含义,请参见Pytorch文档)。请注意, [Archictecture*]中定义的不同架构可以具有不同的优化超参数,甚至可以使用不同的优化算法。
[model]
model_proto = proto/model.proto
model = out_dnn1=compute(MLP_layers1,mfcc)
loss_final=cost_nll(out_dnn1,lab_cd)
err_final=cost_err(out_dnn1,lab_cd)
本节中指定了所有各种特征和架构的组合方式,并具有非常简单和直观的元语言。字段模型:描述如何将功能和体系结构连接到生成作为输出一组后验概率。行OUT_DNN1 = COMPUTE(MLP_LAYERS,MFCC)的意思是“馈送称为MLP_LAYERS1的体系结构,具有称为MFCC的功能,并将输出存储到可变的OUT_DNN1中”。从神经网络输出OUT_DNN1中,使用称为lab_cd的标签计算出误差和损失函数,这些标签必须先前定义为[数据集*]部分。 ERR_FINAL和LOSS_FINAL字段是定义模型最终输出的强制性子字段。
在CFG/Timit_baselines/timit_mfcc_fbank_fmllr_ligru_ligru_best.cfg中,报告了一个更复杂的示例(在此处讨论只是为了强调工具包的潜力):
[model]
model_proto=proto/model.proto
model:conc1=concatenate(mfcc,fbank)
conc2=concatenate(conc1,fmllr)
out_dnn1=compute(MLP_layers_first,conc2)
out_dnn2=compute(liGRU_layers,out_dnn1)
out_dnn3=compute(MLP_layers_second,out_dnn2)
out_dnn4=compute(MLP_layers_last,out_dnn3)
out_dnn5=compute(MLP_layers_last2,out_dnn3)
loss_mono=cost_nll(out_dnn5,lab_mono)
loss_mono_w=mult_constant(loss_mono,1.0)
loss_cd=cost_nll(out_dnn4,lab_cd)
loss_final=sum(loss_cd,loss_mono_w)
err_final=cost_err(out_dnn4,lab_cd)
在这种情况下,我们首先是连接MFCC,FBANK和FMLLR功能,然后进食MLP。 MLP的输出被馈入A复发性神经网络(特别是LI-GRU模型)。然后,我们还有另一个MLP层( MLP_LAYERS_SECOND ),然后有两个SoftMax分类器(即MLP_LAYERS_LAST , MLP_LAYERS_LAST2 )。第一个估计了标准上下文依赖性状态,而第二个估计单声目标。最终成本函数是这两个预测之间的加权总和。通过这种方式,我们实施了单声正则化,事实证明这对于提高ASR性能很有用。
完整的模型可以被视为单个大型计算图,其中[模型]部分中使用的所有基本体系结构均已共同训练。对于每个迷你批次,输入特征将通过完整模型传播,并且使用指定的标签计算Cost_final。然后,计算相对于架构的所有可学习参数的成本函数的梯度。然后,将所有架构的所有参数与[Architecture*]部分中指定的算法一起更新。
[forward]
forward_out = out_dnn1
normalize_posteriors = True
normalize_with_counts_from = lab_cd
save_out_file = True
require_decoding = True
前向截面首先定义要向前的输出(必须将其定义为模型部分)。如果归一化_posteriors= true ,则这些后验通过其先验(使用计数文件)进行标准化。如果save_out_file = true ,则存储后文件(通常是一个很大的ARK文件),而如果不再需要时删除此文件,则该文件将被删除。 require_decoding是一个布尔值,它指定是否需要解码指定的输出。字段归一化_with_counts_from设置,该集合使用以标准化后验概率。
[decoding]
decoding_script_folder = kaldi_decoding_scripts/
decoding_script = decode_dnn.sh
decoding_proto = proto/decoding.proto
min_active = 200
max_active = 7000
max_mem = 50000000
beam = 13.0
latbeam = 8.0
acwt = 0.2
max_arcs = -1
skip_scoring = false
scoring_script = local/score.sh
scoring_opts = "--min-lmwt 1 --max-lmwt 10"
norm_vars = False
解码部分报告了有关解码的参数,即允许一个步骤从DNN提供的上下文依赖性概率的顺序传递为一系列单词。字段decoding_script_folder指定存储解码脚本的文件夹。 The decoding script field is the script used for decoding (eg, decode_dnn.sh ) that should be in the decoding_script_folder specified before. The field decoding_proto reports all the parameters needed for the considered decoding script.
To make the code more flexible, the config parameters can also be specified within the command line. For example, you can run:
python run_exp.py quick_test/example_newcode.cfg --optimization,lr=0.01 --batches,batch_size=4
The script will replace the learning rate in the specified cfg file with the specified lr value. The modified config file is then stored into out_folder/config.cfg .
The script run_exp.py automatically creates chunk-specific config files, that are used by the run_nn function to perform a single chunk training. The structure of chunk-specific cfg files is very similar to that of the global one. The main difference is a field to_do={train, valid, forward} that specifies the type of processing to on the features chunk specified in the field fea .
Why proto files? Different neural networks, optimization algorithms, and HMM decoders might depend on a different set of hyperparameters. To address this issue, our current solution is based on the definition of some prototype files (for global, chunk, architecture config files). In general, this approach allows a more transparent check of the fields specified into the global config file. Moreover, it allows users to easily add new parameters without changing any line of the python code. For instance, to add a user-defined model, a new proto file (eg, user-model.prot o) that specifies the hyperparameter must be written. Then, the user should only write a class (eg, user-model in neural_networks.py ) that implements the architecture).
The toolkit is designed to allow users to easily plug-in their own acoustic models. To add a customized neural model do the following steps:
[proto]
dnn_lay=str_list
dnn_drop=float_list(0.0,1.0)
dnn_use_laynorm_inp=bool
dnn_use_batchnorm_inp=bool
dnn_use_batchnorm=bool_list
dnn_use_laynorm=bool_list
dnn_act=str_list
The parameter dnn_lay must be a list of string, dnn_drop (ie, the dropout factors for each layer) is a list of float ranging from 0.0 and 1.0, dnn_use_laynorm_inp and dnn_use_batchnorm_inp are booleans that enable or disable batch or layer normalization of the input. dnn_use_batchnorm and dnn_use_laynorm are a list of boolean that decide layer by layer if batch/layer normalization has to be used. The parameter dnn_act is again a list of string that sets the activation function of each layer. Since every model is based on its own set of hyperparameters, different models have a different prototype file. For instance, you can take a look into GRU.proto and see that the hyperparameter list is different from that of a standard MLP. Similarly to the previous examples, you should add here your list of hyperparameters and save the file.
Write a PyTorch class implementing your model. Open the library neural_networks.py and look at some of the models already implemented. For simplicity, you can start taking a look into the class MLP. The classes have two mandatory methods: init and forward . The first one is used to initialize the architecture, the second specifies the list of computations to do. The method init takes in input two variables that are automatically computed within the run_nn function. inp_dim is simply the dimensionality of the neural network input, while options is a dictionary containing all the parameters specified into the section architecture of the configuration file.
For instance, you can access to the DNN activations of the various layers in this way: options['dnn_lay'].split(',') . As you might see from the MLP class, the initialization method defines and initializes all the parameters of the neural network. The forward method takes in input a tensor x (ie, the input data) and outputs another vector containing x. If your model is a sequence model (ie, if there is at least one architecture with arch_seq_model=true in the cfg file), x is a tensor with (time_steps, batches, N_in), otherwise is a (batches, N_in) matrix. The class forward defines the list of computations to transform the input tensor into a corresponding output tensor. The output must have the sequential format (time_steps, batches, N_out) for recurrent models and the non-sequential format (batches, N_out) for feed-forward models. Similarly to the already-implemented models the user should write a new class (eg, myDNN) that implements the customized model:
class myDNN(nn.Module):
def __init__(self, options,inp_dim):
super(myDNN, self).__init__()
// initialize the parameters
def forward(self, x):
// do some computations out=f(x)
return out
[architecture1]
arch_name= mynetwork (this is a name you would like to use to refer to this architecture within the following model section)
arch_proto=proto/myDNN.proto (here is the name of the proto file defined before)
arch_library=neural_networks (this is the name of the library where myDNN is implemented)
arch_class=myDNN (This must be the name of the class you have implemented)
arch_pretrain_file=none (With this you can specify if you want to pre-train your model)
arch_freeze=False (set False if you want to update the parameters of your model)
arch_seq_model=False (set False for feed-forward models, True for recurrent models)
Then, you have to specify proper values for all the hyperparameters specified in proto/myDNN.proto . For the MLP.proto , we have:
dnn_lay=1024,1024,1024,1024,1024,N_out_lab_cd
dnn_drop=0.15,0.15,0.15,0.15,0.15,0.0
dnn_use_laynorm_inp=False
dnn_use_batchnorm_inp=False
dnn_use_batchnorm=True,True,True,True,True,False
dnn_use_laynorm=False,False,False,False,False,False
dnn_act=relu,relu,relu,relu,relu,softmax
Then, add the following parameters related to the optimization of your own architecture. You can use here standard sdg, adam, or rmsprop (see cfg/TIMIT_baselines/TIMIT_LSTM_mfcc.cfg for an example with rmsprop):
arch_lr=0.08
arch_halving_factor=0.5
arch_improvement_threshold=0.001
arch_opt=sgd
opt_momentum=0.0
opt_weight_decay=0.0
opt_dampening=0.0
opt_nesterov=False
Save the configuration file into the cfg folder (eg, cfg/myDNN_exp.cfg ).
Run the experiment with:
python run_exp.py cfg/myDNN_exp.cfg
When implementing a new model, an important debug test consists of doing an overfitting experiment (to make sure that the model is able to overfit a tiny dataset). If the model is not able to overfit, it means that there is a major bug to solve.
A hyperparameter tuning is often needed in deep learning to search for proper neural architectures. To help tuning the hyperparameters within PyTorch-Kaldi, we have implemented a simple utility that implements a random search. In particular, the script tune_hyperparameters.py generates a set of random configuration files and can be run in this way:
python tune_hyperparameters.py cfg/TIMIT_MLP_mfcc.cfg exp/TIMIT_MLP_mfcc_tuning 10 arch_lr=randfloat(0.001,0.01) batch_size_train=randint(32,256) dnn_act=choose_str{relu,relu,relu,relu,softmax|tanh,tanh,tanh,tanh,softmax}
The first parameter is the reference cfg file that we would like to modify, while the second one is the folder where the random configuration files are saved. The third parameter is the number of the random config file that we would like to generate. There is then the list of all the hyperparameters that we want to change. For instance, arch_lr=randfloat(0.001,0.01) will replace the field arch_lr with a random float ranging from 0.001 to 0.01. batch_size_train=randint(32,256) will replace batch_size_train with a random integer between 32 and 256 and so on. Once the config files are created, they can be run sequentially or in parallel with:
python run_exp.py $cfg_file
PyTorch-Kaldi can be used with any speech dataset. To use your own dataset, the steps to take are similar to those discussed in the TIMIT/Librispeech tutorials. In general, what you have to do is the following:
python run_exp.py $cfg_file . The current version of PyTorch-Kaldi supports input features stored with the Kaldi ark format. If the user wants to perform experiments with customized features, the latter must be converted into the ark format. Take a look into the Kaldi-io-for-python git repository (https://github.com/vesis84/kaldi-io-for-python) for a detailed description about converting numpy arrays into ark files. Moreover, you can take a look into our utility called save_raw_fea.py. This script generates Kaldi ark files containing raw features, that are later used to train neural networks fed by the raw waveform directly (see the section about processing audio with SincNet).
The current version of Pytorch-Kaldi supports the standard production process of using a Pytorch-Kaldi pre-trained acoustic model to transcript one or multiples .wav files. It is important to understand that you must have a trained Pytorch-Kaldi model. While you don't need labels or alignments anymore, Pytorch-Kaldi still needs many files to transcript a new audio file:
Once you have all these files, you can start adding your dataset section to the global configuration file. The easiest way is to copy the cfg file used to train your acoustic model and just modify by adding a new [dataset] :
[dataset4]
data_name = myWavFile
fea = fea_name=fbank
fea_lst=myWavFilePath/data/feats.scp
fea_opts=apply-cmvn --utt2spk=ark:myWavFilePath/data//utt2spk ark:myWavFilePath/cmvn_test.ark ark:- ark:- | add-deltas --delta-order=0 ark:- ark:- |
cw_left=5
cw_right=5
lab = lab_name=none
lab_data_folder=myWavFilePath/data/
lab_graph=myWavFilePath/exp/tri3/graph
n_chunks=1
[data_use]
train_with = TIMIT_tr
valid_with = TIMIT_dev
forward_with = myWavFile
The key string for your audio file transcription is lab_name=none . The none tag asks Pytorch-Kaldi to enter a production mode that only does the forward propagation and decoding without any labels. You don't need TIMIT_tr and TIMIT_dev to be on your production server since Pytorch-Kaldi will skip this information to directly go to the forward phase of the dataset given in the forward_with field. As you can see, the global fea field requires the exact same parameters than standard training or testing dataset, while the lab field only requires two parameters. Please, note that lab_data_folder is nothing more than the same path as fea_lst . Finally, you still need to specify the number of chunks you want to create to process this file (1 hour = 1 chunk).
警告
In your standard .cfg, you might have used keywords such as N_out_lab_cd that can not be used anymore. Indeed, in a production scenario, you don't want to have the training data on your machine. Therefore, all the variables that were on your .cfg file must be replaced by their true values. To replace all the N_out_{mono,lab_cd} you can take a look at the output of:
hmm-info /path/to/the/final.mdl/used/to/generate/the/training/ali
Then, if you normalize posteriors as (check in your .cfg Section forward):
normalize_posteriors = True
normalize_with_counts_from = lab_cd
You must replace lab_cd by:
normalize_posteriors = True
normalize_with_counts_from = /path/to/ali_train_pdf.counts
This normalization step is crucial for HMM-DNN speech recognition. DNNs, in fact, provide posterior probabilities, while HMMs are generative models that work with likelihoods. To derive the required likelihoods, one can simply divide the posteriors by the prior probabilities. To create this ali_train_pdf.counts file you can follow:
alidir=/path/to/the/exp/tri_ali (change it with your path to the exp with the ali)
num_pdf=$(hmm-info $alidir/final.mdl | awk '/pdfs/{print $4}')
labels_tr_pdf="ark:ali-to-pdf $alidir/final.mdl "ark:gunzip -c $alidir/ali.*.gz |" ark:- |"
analyze-counts --verbose=1 --binary=false --counts-dim=$num_pdf "$labels_tr_pdf" ali_train_pdf.counts
et voilà ! In a production scenario, you might need to transcript a huge number of audio files, and you don't want to create as much as needed .cfg file. In this extent, and after creating this initial production .cfg file (you can leave the path blank), you can call the run_exp.py script with specific arguments referring to your different.wav features:
python run_exp.py cfg/TIMIT_baselines/TIMIT_MLP_fbank_prod.cfg --dataset4,fea,0,fea_lst="myWavFilePath/data/feats.scp" --dataset4,lab,0,lab_data_folder="myWavFilePath/data/" --dataset4,lab,0,lab_graph="myWavFilePath/exp/tri3/graph/"
This command will internally alter the configuration file with your specified paths, and run and your defined features! Note that passing long arguments to the run_exp.py script requires a specific notation. --dataset4 specifies the name of the created section, fea is the name of the higher level field, fea_lst or lab_graph are the name of the lowest level field you want to change. The 0 is here to indicate which lowest level field you want to alter, indeed some configuration files may contain multiple lab_graph per dataset! Therefore, 0 indicates the first occurrence, 1 the second ... Paths MUST be encapsulated by " " to be interpreted as full strings! Note that you need to alter the data_name and forward_with fields if you don't want different .wav files transcriptions to erase each other (decoding files are stored accordingly to the field data_name ). --dataset4,data_name=MyNewName --data_use,forward_with=MyNewName .
In order to give users more flexibility, the latest version of PyTorch-Kaldi supports scheduling of the batch size, max_seq_length_train, learning rate, and dropout factor. This means that it is now possible to change these values during training. To support this feature, we implemented the following formalisms within the config files:
batch_size_train = 128*12 | 64*10 | 32*2
In this case, our batch size will be 128 for the first 12 epochs, 64 for the following 10 epochs, and 32 for the last two epochs. By default "*" means "for N times", while "|" is used to indicate a change of the batch size. Note that if the user simply sets batch_size_train = 128 , the batch size is kept fixed during all the training epochs by default.
A similar formalism can be used to perform learning rate scheduling:
arch_lr = 0.08*10|0.04*5|0.02*3|0.01*2|0.005*2|0.0025*2
In this case, if the user simply sets arch_lr = 0.08 the learning rate is annealed with the new-bob procedure used in the previous version of the toolkit. In practice, we start from the specified learning rate and we multiply it by a halving factor every time that the improvement on the validation dataset is smaller than the threshold specified in the field arch_improvement_threshold .
Also the dropout factor can now be changed during training with the following formalism:
dnn_drop = 0.15*12|0.20*12,0.15,0.15*10|0.20*14,0.15,0.0
With the line before we can set a different dropout rate for different layers and for different epochs. For instance, the first hidden layer will have a dropout rate of 0.15 for the first 12 epochs, and 0.20 for the other 12. The dropout factor of the second layer, instead, will remain constant to 0.15 over all the training. The same formalism is used for all the layers. Note that "|" indicates a change in the dropout factor within the same layer, while "," indicates a different layer.
You can take a look here into a config file where batch sizes, learning rates, and dropout factors are changed here:
cfg/TIMIT_baselines/TIMIT_mfcc_basic_flex.cfg
或这里:
cfg/TIMIT_baselines/TIMIT_liGRU_fmllr_lr_schedule.cfg
The project is still in its initial phase and we invite all potential contributors to participate. We hope to build a community of developers larger enough to progressively maintain, improve, and expand the functionalities of our current toolkit. For instance, it could be helpful to report any bug or any suggestion to improve the current version of the code. People can also contribute by adding additional neural models, that can eventually make richer the set of currently-implemented architectures.
看看我们的视频介绍Sincnet
SincNet is a convolutional neural network recently proposed to process raw audio waveforms. In particular, SincNet encourages the first layer to discover more meaningful filters by exploiting parametrized sinc functions. In contrast to standard CNNs, which learn all the elements of each filter, only low and high cutoff frequencies of band-pass filters are directly learned from data. This inductive bias offers a very compact way to derive a customized filter-bank front-end, that only depends on some parameters with a clear physical meaning.
For a more detailed description of the SincNet model, please refer to the following papers:
M. Ravanelli, Y. Bengio, "Speaker Recognition from raw waveform with SincNet", in Proc. of SLT 2018 ArXiv
M. Ravanelli, Y.Bengio, "Interpretable Convolutional Filters with SincNet", in Proc. of NIPS@IRASL 2018 ArXiv
To use this model for speech recognition on TIMIT, to the following steps:
python ./run_exp.py cfg/TIMIT_baselines/TIMIT_SincNet_raw.cfg
In the following table, we compare the result of SincNet with other feed-forward neural network:
| 模型 | WER(%) |
|---|---|
| MLP -fbank | 18.7 |
| MLP -mfcc | 18.2 |
| CNN -raw | 18.1 |
| SincNet -raw | 17.2 |
In this section, we show how to use PyTorch-Kaldi to jointly train a cascade between a speech enhancement and a speech recognition neural networks. The speech enhancement has the goal of improving the quality of the speech signal by minimizing the MSE between clean and noisy features. The enhanced features then feed another neural network that predicts context-dependent phone states.
In the following, we report a toy-task example based on a reverberated version of TIMIT, that is only intended to show how users should set the config file to train such a combination of neural networks. Even though some implementation details (and the adopted datasets) are different, this tutorial is inspired by this paper:
To run the system do the following steps:
1- Make sure you have the standard clean version of TIMIT available.
2- Run the Kaldi s5 baseline of TIMIT. This step is necessary to compute the clean features (that will be the labels of the speech enhancement system) and the alignments (that will be the labels of the speech recognition system). We recommend running the full timit s5 recipe (including the DNN training).
3- The standard TIMIT recipe uses MFCCs features. In this tutorial, instead, we use FBANK features. To compute FBANK features run the following script in $KALDI_ROOT/egs/TIMIT/s5 :
feadir=fbank
for x in train dev test; do
steps/make_fbank.sh --cmd "$train_cmd" --nj $feats_nj data/$x exp/make_fbank/$x $feadir
steps/compute_cmvn_stats.sh data/$x exp/make_fbank/$x $feadir
done
Note that we use 40 FBANKS here, while Kaldi uses by default 23 FBANKs. To compute 40-dimensional features go into "$KALDI_ROOT/egs/TIMIT/conf/fbank.conf" and change the number of considered output filters.
4- Go to this external repository and follow the steps to generate a reverberated version of TIMIT starting from the clean one. Note that this is just a toy task that is only helpful to show how setting up a joint-training system.
5- Compute the FBANK features for the TIMIT_rev dataset. To do it, you can copy the scripts in $KALDI_ROOT/egs/TIMIT/ into $KALDI_ROOT/egs/TIMIT_rev/ . Please, copy also the data folder. Note that the audio files in the TIMIT_rev folders are saved with the standard WAV format, while TIMIT is released with the SPHERE format. To bypass this issue, open the files data/train/wav.scp , data/dev/wav.scp , data/test/wav.scp and delete the part about SPHERE reading (eg, /home/mirco/kaldi-trunk/tools/sph2pipe_v2.5/sph2pipe -f wav ). You also have to change the paths from the standard TIMIT to the reverberated one (eg replace /TIMIT/ with /TIMIT_rev/). Remind to remove the final pipeline symbol“ |”. Save the changes and run the computation of the fbank features in this way:
feadir=fbank
for x in train dev test; do
steps/make_fbank.sh --cmd "$train_cmd" --nj $feats_nj data/$x exp/make_fbank/$x $feadir
steps/compute_cmvn_stats.sh data/$x exp/make_fbank/$x $feadir
done
Remember to change the $KALDI_ROOT/egs/TIMIT_rev/conf/fbank.conf file in order to compute 40 features rather than the 23 FBANKS of the default configuration.
6- Once features are computed, open the following config file:
cfg/TIMIT_baselines/TIMIT_rev/TIMIT_joint_training_liGRU_fbank.cfg
Remember to change the paths according to where data are stored in your machine. As you can see, we consider two types of features. The fbank_rev features are computed from the TIMIT_rev dataset, while the fbank_clean features are derived from the standard TIMIT dataset and are used as targets for the speech enhancement neural network. As you can see in the [model] section of the config file, we have the cascade between networks doing speech enhancement and speech recognition. The speech recognition architecture jointly estimates both context-dependent and monophone targets (thus using the so-called monophone regularization). To run an experiment type the following command:
python run_exp.py cfg/TIMIT_baselines/TIMIT_rev/TIMIT_joint_training_liGRU_fbank.cfg
7- Results With this configuration file, you should obtain a Phone Error Rate (PER)=28.1% . Note that some oscillations around this performance are more than natural and are due to different initialization of the neural parameters.
You can take a closer look into our results here
In this tutorial, we use the DIRHA-English dataset to perform a distant speech recognition experiment. The DIRHA English Dataset is a multi-microphone speech corpus being developed under the EC project DIRHA. The corpus is composed of both real and simulated sequences recorded with 32 sample-synchronized microphones in a domestic environment. The database contains signals of different characteristics in terms of noise and reverberation making it suitable for various multi-microphone signal processing and distant speech recognition tasks. The part of the dataset currently released is composed of 6 native US speakers (3 Males, 3 Females) uttering 409 wall-street journal sentences. The training data have been created using a realistic data contamination approach, that is based on contaminating the clean speech wsj-5k sentences with high-quality multi-microphone impulse responses measured in the targeted environment. For more details on this dataset, please refer to the following papers:
M. Ravanelli, L. Cristoforetti, R. Gretter, M. Pellin, A. Sosi, M. Omologo, "The DIRHA-English corpus and related tasks for distant-speech recognition in domestic environments", in Proceedings of ASRU 2015. ArXiv
M. Ravanelli, P. Svaizer, M. Omologo, "Realistic Multi-Microphone Data Simulation for Distant Speech Recognition", in Proceedings of Interspeech 2016. ArXiv
In this tutorial, we use the aforementioned simulated data for training (using LA6 microphone), while test is performed using the real recordings (LA6). This task is very realistic, but also very challenging. The speech signals are characterized by a reverberation time of about 0.7 seconds. Non-stationary domestic noises (such as vacuum cleaner, steps, phone rings, etc.) are also present in the real recordings.
Let's start now with the practical tutorial.
1- If not available, download the DIRHA dataset from the LDC website. LDC releases the full dataset for a small fee.
2- Go this external reposotory. As reported in this repository, you have to generate the contaminated WSJ dataset with the provided MATLAB script. Then, you can run the proposed KALDI baseline to have features and labels ready for our pytorch-kaldi toolkit.
3- Open the following configuration file:
cfg/DIRHA_baselines/DIRHA_liGRU_fmllr.cfg
The latter configuration file implements a simple RNN model based on a Light Gated Recurrent Unit (Li-GRU). We used fMLLR as input features. Change the paths and run the following command:
python run_exp.py cfg/DIRHA_baselines/DIRHA_liGRU_fmllr.cfg
4- Results: The aforementioned system should provide Word Error Rate (WER%)=23.2% . You can find the results obtained by us here.
Using the other configuration files in the cfg/DIRHA_baselines folder you can perform experiments with different setups. With the provided configuration files you can obtain the following results:
| 模型 | WER(%) |
|---|---|
| MLP | 26.1 |
| gru | 25.3 |
| Li-GRU | 23.8 |
The current version of the repository is mainly designed for speech recognition experiments. We are actively working a new version, which is much more flexible and can manage input/output different from Kaldi features/labels. Even with the current version, however, it is possible to implement other systems, such as an autoencoder.
An autoencoder is a neural network whose inputs and outputs are the same. The middle layer normally contains a bottleneck that forces our representations to compress the information of the input. In this tutorial, we provide a toy example based on the TIMIT dataset. For instance, see the following configuration file:
cfg/TIMIT_baselines/TIMIT_MLP_fbank_autoencoder.cfg
Our inputs are the standard 40-dimensional fbank coefficients that are gathered using a context windows of 11 frames (ie, the total dimensionality of our input is 440). A feed-forward neural network (called MLP_encoder) encodes our features into a 100-dimensional representation. The decoder (called MLP_decoder) is fed by the learned representations and tries to reconstruct the output. The system is trained with Mean Squared Error (MSE) metric. Note that in the [Model] section we added this line “err_final=cost_err(dec_out,lab_cd)” at the end. The current version of the model, in fact, by default needs that at least one label is specified (we will remove this limit in the next version).
You can train the system running the following command:
python run_exp.py cfg/TIMIT_baselines/TIMIT_MLP_fbank_autoencoder.cfg
The results should look like this:
ep=000 tr=['TIMIT_tr'] loss=0.139 err=0.999 valid=TIMIT_dev loss=0.076 err=1.000 lr_architecture1=0.080000 lr_architecture2=0.080000 time(s)=41
ep=001 tr=['TIMIT_tr'] loss=0.098 err=0.999 valid=TIMIT_dev loss=0.062 err=1.000 lr_architecture1=0.080000 lr_architecture2=0.080000 time(s)=39
ep=002 tr=['TIMIT_tr'] loss=0.091 err=0.999 valid=TIMIT_dev loss=0.058 err=1.000 lr_architecture1=0.040000 lr_architecture2=0.040000 time(s)=39
ep=003 tr=['TIMIT_tr'] loss=0.088 err=0.999 valid=TIMIT_dev loss=0.056 err=1.000 lr_architecture1=0.020000 lr_architecture2=0.020000 time(s)=38
ep=004 tr=['TIMIT_tr'] loss=0.087 err=0.999 valid=TIMIT_dev loss=0.055 err=0.999 lr_architecture1=0.010000 lr_architecture2=0.010000 time(s)=39
ep=005 tr=['TIMIT_tr'] loss=0.086 err=0.999 valid=TIMIT_dev loss=0.054 err=1.000 lr_architecture1=0.005000 lr_architecture2=0.005000 time(s)=39
ep=006 tr=['TIMIT_tr'] loss=0.086 err=0.999 valid=TIMIT_dev loss=0.054 err=1.000 lr_architecture1=0.002500 lr_architecture2=0.002500 time(s)=39
ep=007 tr=['TIMIT_tr'] loss=0.086 err=0.999 valid=TIMIT_dev loss=0.054 err=1.000 lr_architecture1=0.001250 lr_architecture2=0.001250 time(s)=39
ep=008 tr=['TIMIT_tr'] loss=0.086 err=0.999 valid=TIMIT_dev loss=0.054 err=0.999 lr_architecture1=0.000625 lr_architecture2=0.000625 time(s)=41
ep=009 tr=['TIMIT_tr'] loss=0.086 err=0.999 valid=TIMIT_dev loss=0.054 err=0.999 lr_architecture1=0.000313 lr_architecture2=0.000313 time(s)=38
You should only consider the field "loss=". The filed "err=" only contains not useuful information in this case (for the aforementioned reason). You can take a look into the generated features typing the following command:
copy-feats ark:exp/TIMIT_MLP_fbank_autoencoder/exp_files/forward_TIMIT_test_ep009_ck00_enc_out.ark ark,t:- | more
[1] M. Ravanelli, T. Parcollet, Y. Bengio, "The PyTorch-Kaldi Speech Recognition Toolkit", ArxIv
[2] M. Ravanelli, P. Brakel, M. Omologo, Y. Bengio, "Improving speech recognition by revising gated recurrent units", in Proceedings of Interspeech 2017. ArXiv
[3] M. Ravanelli, P. Brakel, M. Omologo, Y. Bengio, "Light Gated Recurrent Units for Speech Recognition", in IEEE Transactions on Emerging Topics in Computational Intelligence. arxiv
[4] M. Ravanelli, "Deep Learning for Distant Speech Recognition", PhD Thesis, Unitn 2017. ArXiv
[5] T. Parcollet, M. Ravanelli, M. Morchid, G. Linarès, C. Trabelsi, R. De Mori, Y. Bengio, "Quaternion Recurrent Neural Networks", in Proceedings of ICLR 2019 ArXiv
[6] T. Parcollet, M. Morchid, G. Linarès, R. De Mori, "Bidirectional Quaternion Long-Short Term Memory Recurrent Neural Networks for Speech Recognition", in Proceedings of ICASSP 2019 ArXiv