pytorch kaldi Download - pytorch kaldi Source Code Download

مجموعة أدوات التعرف على الكلام Pytorch-Kaldi

Pytorch-Kaldi هو مستودع مفتوح المصدر لتطوير أنظمة التعرف على الكلام على أحدث طراز DNN/HMM. تتم إدارة جزء DNN بواسطة Pytorch ، بينما يتم تنفيذ استخراج الميزات وحساب الملصقات وفك التشفير باستخدام مجموعة أدوات Kaldi.

يحتوي هذا المستودع على الإصدار الأخير من مجموعة أدوات Pytorch-Kaldi (Pytorch-Kaldi-V1.0). لإلقاء نظرة على الإصدار السابق (Pytorch-Kaldi-V0.1) ، انقر هنا.

إذا كنت تستخدم هذا الرمز أو جزء منه ، فيرجى الاستشهاد بالورقة التالية:

M. Ravanelli ، T. Parcollet ، Y. Bengio ، "The Pytorch-Kaldi Toolsition Toolkit" ، Arxiv

 @inproceedings{pytorch-kaldi,
title    = {The PyTorch-Kaldi Speech Recognition Toolkit},
author    = {M. Ravanelli and T. Parcollet and Y. Bengio},
booktitle    = {In Proc. of ICASSP},
year    = {2019}
}

يتم إصدار مجموعة الأدوات تحت رخصة Creative Commons Attribution 4.0 الدولية . يمكنك نسخ وتوزيع وتعديل رمز البحث والأغراض التجارية وغير التجارية. نطلب فقط الاستشهاد بالورقة المشار إليها أعلاه.

لتحسين الشفافية وإمكانية التكرار لنتائج التعرف على الكلام ، نمنح المستخدمين إمكانية إطلاق نموذج Pytorch-Kaldi داخل هذا المستودع. لا تتردد في الاتصال بنا (أو تقديم طلب سحب) لذلك. علاوة على ذلك ، إذا كانت ورقتك تستخدم Pytorch-Kaldi ، فمن الممكن أيضًا الإعلان عنها في هذا المستودع.

شاهد مقطع فيديو تمهيدي قصير على مجموعة أدوات Pytorch-Kaldi

الكلام

يسعدنا أن نعلن أن مشروع الكلام (https://speechbrain.github.io/) هو الآن عام! نشجع المستخدمين بشدة على الترحيل إلى الكلام. إنه مشروع أفضل بكثير يدعم بالفعل العديد من مهام معالجة الكلام ، مثل التعرف على الكلام ، التعرف على المتحدثين ، SLU ، تعزيز الكلام ، فصل الكلام ، معالجة الإشارات متعددة الميكروفون وغيرها الكثير.

الهدف من ذلك هو تطوير مجموعة أدوات واحدة ومرنة وسهلة الاستخدام يمكن استخدامها بسهولة لتطوير أنظمة الكلام الحديثة للتعرف على الكلام (كل من الطرف إلى النهاية و HMM-DNN) ، والتعرف على المتحدثين ، وفصل الكلام ، ومعالجة الإشارات متعددة الميكروفون (على سبيل المثال ، الشكل الشعاعي) ، والتعلم الذاتي ، والعديد من الأشياء الأخرى.

سيقود المشروع ميلا ويتم رعايته من قبل سامسونج ، نفيديا ، دولبي. سيستفيد Probskbrain أيضًا من تعاون وخبرات الشركات الأخرى مثل Facebook/Pytorch و Ibmresearch و Fluentai.

نحن نبحث بنشاط عن متعاونين. لا تتردد في الاتصال بنا على [email protected] إذا كنت مهتمًا بالتعاون.

بفضل رعاةنا ، يمكننا أيضًا توظيف متدربين يعملون في MILA في مشروع Fkebrain. المرشح المثالي هو طالب دكتوراه يتمتع بخبرة في تقنيات Pytorch و Cleint (أرسل سيرتك الذاتية إلى [email protected])

سيتطلب تطوير الكلام بضعة أشهر قبل وجود مستودع عمل. وفي الوقت نفسه ، سوف نستمر في تقديم الدعم لمشروع Pytorch-Kaldi.

ابقوا متابعين!

جدول المحتويات

مقدمة
المتطلبات الأساسية
كيفية التثبيت
التحديثات الأخيرة
دروس:
- توقيت البرنامج التعليمي
- Librispeech البرنامج التعليمي
نظرة عامة على مجموعة الأدوات:
- مجموعة أدوات العمارة
- ملفات التكوين
الأسئلة الشائعة:
- كيف يمكنني الإضافات في النموذج الخاص بي؟
- كيف يمكنني ضبط المقاييس المفرطة؟
- كيف يمكنني استخدام مجموعة البيانات الخاصة بي؟
- كيف يمكنني المكونات الخاصة بي الخاصة؟
- كيف يمكنني إرسال ملفات الصوت الخاصة بي؟
- حجم الدُفعة ومعدل التعلم وجدولة Droput
- كيف يمكنني المساهمة في المشروع؟
إضافي:
- التعرف على الكلام من الشكل الموجي الخام مع sincnet
- التدريب المشترك بين تعزيز الكلام و ASR
- التعرف على الكلام بعيدة مع ديرها
- تدريب Autoencoder
مراجع

مقدمة

يهدف مشروع Pytorch-Kaldi إلى سد الفجوة بين مجموعات أدوات Kaldi و Pytorch ، في محاولة لروث كفاءة Kaldi ومرونة Pytorch. Pytorch-Kaldi ليس فقط واجهة بسيطة بين مجموعات الأدوات هذه ، ولكنها تضم العديد من الميزات المفيدة لتطوير معرفات الكلام الحديثة. على سبيل المثال ، تم تصميم الرمز خصيصًا للمواد الصوتية المعرفة من قبل المستخدم بشكل طبيعي. كبديل ، يمكن للمستخدمين استغلال العديد من الشبكات العصبية التي يتم تنفيذها مسبقًا والتي يمكن تخصيصها باستخدام ملفات التكوين البديهية. يدعم Pytorch-Kaldi العديد من تدفقات الميزات والتسمية بالإضافة إلى مجموعات من الشبكات العصبية ، مما يتيح استخدام البنية العصبية المعقدة. تم إصدار مجموعة الأدوات بشكل عام إلى جانب وثائق غنية وهي مصممة للعمل بشكل صحيح محليًا أو على مجموعات HPC.

بعض ميزات الإصدار الجديد من مجموعة أدوات Pytorch-Kaldi:

واجهة سهلة مع Kaldi.
سهولة المكونات الإضافية من النماذج المعرفة من قبل المستخدم.
العديد من النماذج التي تم تنفيذها مسبقًا (MLP ، CNN ، RNN ، LSTM ، GRU ، LI-GRU ، SINCNET).
التنفيذ الطبيعي للنماذج المعقدة بناءً على ميزات متعددة ، ملصقات ، والبنية العصبية.
ملفات التكوين السهلة والمرنة.
الاسترداد التلقائي من آخر قطعة تم معالجتها.
التوسعات التلقائية والسياق لميزات الإدخال.
التدريب متعدد GPU.
مصممة للعمل محليا أو على مجموعات HPC.
البرامج التعليمية على مجموعات بيانات TIMIT و Librispeech.

المتطلبات الأساسية

إذا لم يتم ذلك بالفعل ، قم بتثبيت Kaldi (http://kaldi-asr.org/). كما هو مقترح أثناء التثبيت ، لا تنس إضافة مسار ثنائيات Kaldi إلى $ home/.bashrc. على سبيل المثال ، تأكد من أن .bashrc يحتوي على المسارات التالية:

 export KALDI_ROOT=/home/mirco/kaldi-trunk
PATH=$PATH:$KALDI_ROOT/tools/openfst
PATH=$PATH:$KALDI_ROOT/src/featbin
PATH=$PATH:$KALDI_ROOT/src/gmmbin
PATH=$PATH:$KALDI_ROOT/src/bin
PATH=$PATH:$KALDI_ROOT//src/nnetbin
export PATH

تذكر تغيير متغير Kaldi_root باستخدام مسارك. كاختبار أول للتحقق من التثبيت ، افتح قذيفة باش ، اكتب "نسخ نسخ" أو "HMM-INFO" وتأكد من ظهور أخطاء.

إذا لم يتم ذلك بالفعل ، قم بتثبيت Pytorch (http://pytorch.org/). اختبرنا رموزنا على Pytorch 1.0 و Pytorch 0.4. من المحتمل أن ترفع نسخة أقدم من Pytorch. للتحقق من التثبيت ، اكتب "Python" ، وبمجرد إدخاله في وحدة التحكم ، اكتب "استيراد Torch" ، وتأكد من ظهور أخطاء.
نوصي بتشغيل الرمز على جهاز GPU. تأكد من تثبيت مكتبات CUDA (https://developer.nvidia.com/cuda-downloads) وتعمل بشكل صحيح. اختبرنا نظامنا على CUDA 9.0 و 9.1 و 8.0. تأكد من تثبيت Python (يتم اختبار الرمز باستخدام Python 2.7 و Python 3.7). على الرغم من أنه ليس إلزاميًا ، فإننا نقترح استخدام Anaconda (https://anaconda.org/anaconda/python).

التحديثات الأخيرة

19 فبراير 2019: التحديثات:

أصبح من الممكن الآن تغيير حجم الدُفعة ومعدل التعلم وعوامل التسرب أثناء التدريب. قمنا بالتالي بتنفيذ جدولة تدعم الشكليات التالية ضمن ملفات التكوين:

 batch_size_train = 128*12 | 64*10 | 32*2

الخط أعلاه يعني: قم بعمل 12 عصرًا مع 128 دفعة ، 10 عصر مع 64 دفعة ، و 2 عصر مع 32 دفعة. يمكن استخدام شكلية مماثلة في معدل التعلم وجدولة التسرب. انظر هذا القسم لمزيد من المعلومات.

5 فبراير 2019: التحديثات:

تدعم مجموعة أدواتنا الآن تحميل البيانات المتوازية (أي ، يتم تخزين الجزء التالي في الذاكرة أثناء معالجة الجزء الحالي). هذا يسمح بسرعة كبيرة.
عند إجراء مستخدمي تنظيم monophone ، يمكن الآن تعيين "dnn_lay = n_lab_out_mono". وبهذه الطريقة ، يتم استنتاج عدد المونوفونات تلقائيًا بواسطة مجموعة الأدوات الخاصة بنا.
قمنا بدمج مجموعة أدوات Kaldi-Io من مشروع Kaldi-Io-For-Python في Data_io-PY.
لقد قدمنا إعدادًا فرطياً أفضل لـ SINCNET (انظر هذا القسم)
أصدرنا بعض خطوط الأساس مع مجموعة بيانات Dirha (انظر هذا القسم). نحن نقدم أيضًا بعض أمثلة التكوين لأوضاع تلقائية بسيطة (انظر هذا القسم) ونظام يدرب بشكل مشترك تعزيز الكلام ووحدة التعرف على الكلام (انظر هذا القسم)
لقد أصلحنا بعض الأخطاء البسيطة.

ملاحظات على الإصدار التالي: في الإصدار التالي ، نخطط لتوسيع وظائف مجموعة أدواتنا ، ودعم المزيد من النماذج والميزات. والهدف من ذلك هو جعل مجموعة أدواتنا مناسبة للمهام الأخرى المتعلقة بالكلام مثل التعرف على الكلام من طرف إلى نهاية ، وتحديد هوية المتحدثين ، واكتشاف الكلمات الرئيسية ، وفصل الكلام ، واكتشاف نشاط الكلام ، وتعزيز الكلام ، وما إلى ذلك.

كيفية التثبيت

لتثبيت Pytorch-Kaldi ، قم بالخطوات التالية:

تأكد من تثبيت جميع البرامج الموصى بها في أقسام "المتطلبات الأساسية" وتعمل بشكل صحيح
استنساخ مستودع Pytorch-Kaldi:

 git clone https://github.com/mravanelli/pytorch-kaldi

انتقل إلى مجلد المشروع وقم بتثبيت الحزم المطلوبة مع:

 pip install -r requirements.txt

توقيت البرنامج التعليمي

في ما يلي ، نقدم برنامجًا تعليميًا قصيرًا لمجموعة أدوات Pytorch-Kaldi استنادًا إلى مجموعة بيانات Timit الشهيرة.

تأكد من أن لديك مجموعة بيانات توقيت. إذا لم يكن الأمر كذلك ، فيمكن تنزيله من موقع LDC (https://catalog.ldc.upenn.edu/ldc93s1).
تأكد من أن منشآت Kaldi و Pytorch جيدة. تأكد أيضًا من أن مسارات Kaldi تعمل حاليًا (يجب عليك إضافة مسارات Kaldi إلى .bashrc كما ورد في قسم "المتطلبات الأساسية"). على سبيل المثال ، اكتب "Facts-Feats" و "HMM-INFO" وتأكد من عدم ظهور أخطاء.
قم بتشغيل خط الأساس Kaldi S5 من الوقت. هذه الخطوة ضرورية لحساب الميزات والعلامات المستخدمة لاحقًا لتدريب شبكة Pytorch العصبية. نوصي بتشغيل وصفة Timit S5 الكاملة (بما في ذلك تدريب DNN):

 cd kaldi/egs/timit/s5
./run.sh
./local/nnet/run_dnn.sh

وبهذه الطريقة ، يتم إنشاء جميع الملفات اللازمة ويمكن للمستخدم مقارنة النتائج التي تم الحصول عليها بواسطة Kaldi بشكل مباشر مع تلك التي تم تحقيقها باستخدام مجموعة الأدوات الخاصة بنا.

قم بحساب المحاذاة (أي ملصقات حالة الهاتف) لبيانات الاختبار و DEV مع الأوامر التالية (انتقل إلى $ kaldi_root/egs/timit/s5). إذا كنت ترغب في استخدام محاذاة Tri3 ، فاكتب:

 steps/align_fmllr.sh --nj 4 data/dev data/lang exp/tri3 exp/tri3_ali_dev

steps/align_fmllr.sh --nj 4 data/test data/lang exp/tri3 exp/tri3_ali_test

إذا كنت ترغب في استخدام محاذاة DNN (كما هو مقترح) ، اكتب:

 steps/nnet/align.sh --nj 4 data-fmllr-tri3/train data/lang exp/dnn4_pretrain-dbn_dnn exp/dnn4_pretrain-dbn_dnn_ali

steps/nnet/align.sh --nj 4 data-fmllr-tri3/dev data/lang exp/dnn4_pretrain-dbn_dnn exp/dnn4_pretrain-dbn_dnn_ali_dev

steps/nnet/align.sh --nj 4 data-fmllr-tri3/test data/lang exp/dnn4_pretrain-dbn_dnn exp/dnn4_pretrain-dbn_dnn_ali_test

نبدأ هذا البرنامج التعليمي مع شبكة MLP بسيطة للغاية مدربة على ميزات MFCC. قبل بدء التجربة ، ألقِ نظرة على ملف التكوين CFG/Timit_baselines/timit_mlp_mfcc_basic.cfg . راجع وصف ملفات التكوين للحصول على وصف مفصل لجميع حقولها.
قم بتغيير ملف التكوين وفقًا لمساراتك. بخاصة:

قم بتعيين "FEA_LST" مع مسار قائمة تدريب MFCC الخاصة بك (يجب أن يكون في $ kaldi_root/egs/timit/s5/data/train/feats.scp)
أضف المسار الخاص بك (على سبيل المثال ، $ kaldi_root/egs/timit/s5/data/train/utt2spk) إلى "-utt2spk = ark:"
أضف تحويل CMVN على سبيل المثال ، $ kaldi_root/egs/timit/s5/mfcc/cmvn_train.ark
أضف المجلد حيث يتم تخزين الملصقات (على سبيل المثال ، $ kaldi_root/egs/timit/s5/exp/dnn4_pretrain-dbn_dnn_ali للتدريب ، و $ kaldi_root/egs/timit/s5/exp/dnn4_pretrain-dbn_dnn_dn_aliv for data data).

لتجنب الأخطاء ، تأكد من وجود جميع المسارات في ملف CFG. من فضلك ، تجنب استخدام المسارات التي تحتوي على متغيرات bash حيث تتم قراءة المسارات حرفيًا ولا يتم توسيعها تلقائيًا (على سبيل المثال ، استخدم/mirco/mirco/kaldi-trunk/egs/timit/s5/exp/dnn4_pretrain-dbn_dnn_ali بدلاً من $ kaldi_root/egs/timit/s5/s5

قم بتشغيل تجربة ASR:

 python run_exp.py cfg/TIMIT_baselines/TIMIT_MLP_mfcc_basic.cfg

يبدأ هذا البرنامج النصي في تجربة ASR كاملة ويقوم بتدريبات التدريب والتحقق من الصحة والأمام وفك التشفير. يوضح شريط التقدم تطور جميع المراحل المذكورة أعلاه. يقوم البرنامج النصي Run_exp.py بإنشاء الملفات التالية تدريجياً في دليل الإخراج:

Res.Res : ملف يلخص أداء التدريب والتحقق من صحة مختلف عصر التحقق من الصحة.
log.log : ملف يحتوي على أخطاء وتحذيرات محتملة.
Conf.CFG : نسخة من ملف التكوين.
Model.SVG هي صورة توضح النموذج المدروس وكيف يتم توصيل الشبكات العصبية المختلفة. هذا مفيد حقًا لنماذج تصحيح النماذج الأكثر تعقيدًا من هذا النماذج (على سبيل المثال ، النماذج القائمة على شبكات عصبية متعددة).
يحتوي المجلد exp_files على العديد من الملفات التي تلخص تطور التدريب والتحقق من صحة مختلف الحقبات. على سبيل المثال ، الملفات *. ملفات *.cfg هي ملفات التكوين الخاصة بالقطعة (انظر العمارة العامة لمزيد من التفاصيل) ، في حين أن الملفات *.LST الإبلاغ عن قائمة الميزات المستخدمة لتدريب كل قطعة محددة.
في نهاية التدريب ، يتم إنشاء دليل يسمى مخرجات تم إنشاؤها التي تحتوي على مخططات من الخسارة والأخطاء خلال عصر التدريب المختلفة.

لاحظ أنه يمكنك إيقاف التجربة في أي وقت. إذا قمت بتشغيل البرنامج النصي مرة أخرى ، فسيبدأ تلقائيًا من الجزء الأخير الذي تمت معالجته بشكل صحيح. قد يستغرق التدريب بضع ساعات ، اعتمادًا على وحدة معالجة الرسومات المتاحة. لاحظ أيضًا أنه إذا كنت ترغب في تغيير بعض المعلمات من ملف التكوين (على سبيل المثال ، N_Chunks = ، FEA_LST = ، BATCH_SIZE_TRAIN = ، ..) يجب تحديد مجلد إخراج مختلف (OUTPUT_FORDER =).

تصحيح: إذا واجهت بعض الأخطاء ، فإننا نقترح القيام بالشيكات التالية:

ألق نظرة على الإخراج القياسي.
إذا لم يكن ذلك مفيدًا ، فقم بإلقاء نظرة على ملف log.log.
ألقِ نظرة على وظيفة Run_nn في مكتبة Core.py. أضف بعض المطبوعات في الجزء المختلفة من الوظيفة لعزل المشكلة ومعرفة المشكلة.
في نهاية التدريب ، يتم إلغاء معدل خطأ الهاتف (لكل ٪) في ملف Res.Res. لمعرفة مزيد من التفاصيل حول نتائج فك التشفير ، يمكنك الذهاب إلى "decoding_test" في مجلد الإخراج وإلقاء نظرة على مختلف الملفات التي تم إنشاؤها. لهذا المثال المحدد ، حصلنا على ملف res.res التالي:

 ep=000 tr=['TIMIT_tr'] loss=3.398 err=0.721 valid=TIMIT_dev loss=2.268 err=0.591 lr_architecture1=0.080000 time(s)=86
ep=001 tr=['TIMIT_tr'] loss=2.137 err=0.570 valid=TIMIT_dev loss=1.990 err=0.541 lr_architecture1=0.080000 time(s)=87
ep=002 tr=['TIMIT_tr'] loss=1.896 err=0.524 valid=TIMIT_dev loss=1.874 err=0.516 lr_architecture1=0.080000 time(s)=87
ep=003 tr=['TIMIT_tr'] loss=1.751 err=0.494 valid=TIMIT_dev loss=1.819 err=0.504 lr_architecture1=0.080000 time(s)=88
ep=004 tr=['TIMIT_tr'] loss=1.645 err=0.472 valid=TIMIT_dev loss=1.775 err=0.494 lr_architecture1=0.080000 time(s)=89
ep=005 tr=['TIMIT_tr'] loss=1.560 err=0.453 valid=TIMIT_dev loss=1.773 err=0.493 lr_architecture1=0.080000 time(s)=88
.........
ep=020 tr=['TIMIT_tr'] loss=0.968 err=0.304 valid=TIMIT_dev loss=1.648 err=0.446 lr_architecture1=0.002500 time(s)=89
ep=021 tr=['TIMIT_tr'] loss=0.965 err=0.304 valid=TIMIT_dev loss=1.649 err=0.446 lr_architecture1=0.002500 time(s)=90
ep=022 tr=['TIMIT_tr'] loss=0.960 err=0.302 valid=TIMIT_dev loss=1.652 err=0.447 lr_architecture1=0.001250 time(s)=88
ep=023 tr=['TIMIT_tr'] loss=0.959 err=0.301 valid=TIMIT_dev loss=1.651 err=0.446 lr_architecture1=0.000625 time(s)=88
%WER 18.1 | 192 7215 | 84.0 11.9 4.2 2.1 18.1 99.5 | -0.583 | /home/mirco/pytorch-kaldi-new/exp/TIMIT_MLP_basic5/decode_TIMIT_test_out_dnn1/score_6/ctm_39phn.filt.sys

المحقق لكل (٪) هو 18.1 ٪. لاحظ أنه قد يكون هناك بعض التباين في النتائج ، بسبب التهيئة المختلفة على الأجهزة المختلفة. نعتقد أن متوسط الأداء الذي تم الحصول عليه باستخدام بذور التهيئة المختلفة (أي ، تغيير بذرة الحقل في ملف التكوين) أمر بالغ الأهمية بالنسبة للوقت لأن تقلب الأداء الطبيعي قد يخفي تمامًا الأدلة التجريبية. لاحظنا انحرافًا معياريًا يبلغ حوالي 0.2 ٪ لتجارب الوقت.

إذا كنت ترغب في تغيير الميزات ، فيجب عليك أولاً حسابها بمجموعة أدوات Kaldi. لحساب ميزات Fbank ، يجب عليك فتح $ kaldi_root/egs/timit/s5/run.sh وحسابها بالخطوط التالية:

 feadir=fbank

for x in train dev test; do
  steps/make_fbank.sh --cmd "$train_cmd" --nj $feats_nj data/$x exp/make_fbank/$x $feadir
  steps/compute_cmvn_stats.sh data/$x exp/make_fbank/$x $feadir
done

ثم ، قم بتغيير ملف التكوين المذكور أعلاه مع قائمة الميزات الجديدة. إذا كنت قد قمت بالفعل بتشغيل وصفة Timit Kaldi الكاملة ، فيمكنك العثور مباشرة على ميزات FMLLR في $ kaldi_root/egs/timit/s5/data-fmllr-tri3 . إذا قمت بتغذية الشبكة العصبية بهذه الميزات ، فيجب أن تتوقع تحسنًا كبيرًا في الأداء ، نظرًا لتبني تكيف المتحدث.

في مجلد Timit_baseline ، نقترح عدة أمثلة أخرى من خطوط الأساس المحتملة. على غرار المثال السابق ، يمكنك تشغيلها بمجرد الكتابة:

 python run_exp.py $cfg_file

هناك بعض الأمثلة مع متكررة (timit_rnn*، timit_lstm*، timit_gru*، timit_ligru*) و cnn sthemsures (timit_cnn*). نقترح أيضًا نموذجًا أكثر تقدماً (timit_dnn_ligru_dnn_mfcc+fbank+fmllr.cfg) حيث استخدمنا مزيجًا من الشبكات العصبية المتكررة والمتكررة التي يتم تغذيتها بسلسلة من ميزات MFCC و FBANK و FMLLR. لاحظ أن ملفات التكوين الأخيرة تتوافق مع أفضل بنية موضحة في الورقة المرجعية. كما قد ترى من ملفات التكوين المذكورة أعلاه ، نقوم بتحسين أداء ASR من خلال تضمين بعض الحيل مثل تنظيم monophone (أي ، نقدر بشكل مشترك كل من الأهداف المعتمدة على السياق ومستقلة عن السياق). يوضح الجدول التالي النتائج التي تم الحصول عليها عن طريق تشغيل الأنظمة الأخيرة (متوسط لكل ٪):

نموذج	MFCC	fbank	fmllr
Kaldi DNN الأساس	-----	------	18.5
MLP	18.2	18.7	16.7
rnn	17.7	17.2	15.9
Sru	-----	16.6	-----
LSTM	15.1	14.3	14.5
جرو	16.0	15.2	14.9
li-gru	15.5	14.9	14.2

تظهر النتائج أنه ، كما هو متوقع ، تتفوق ميزات FMLLR على معاملات MFCCs ومعاملات Fbanks ، وذلك بفضل عملية تكيف السماعة. تتفوق النماذج المتكررة بشكل كبير على MLP القياسي ، خاصة عند استخدام بنية LSTM و GRU و LI-GU ، والتي تتناول التدرج بشكل فعال من خلال البوابات المضاعفة. يتم الحصول على أفضل نتيجة لكل = 14.2 دولارًا $ ٪ مع نموذج LI-GRU [2،3] ، ويستند إلى بوابة واحدة وبالتالي يوفر 33 ٪ من الحسابات على GRU القياسية.

يتم الحصول على أفضل النتائج بالفعل مع بنية أكثر تعقيدًا تجمع بين ميزات MFCC و FBANK و FMLLR (انظر CFG/Timi_Baselines/Timit_MFCC_FBANK_FMLLR_LIGRU_BEST.CFG ). على حد علمنا ، فإن PER = 13.8 ٪ الذي حققه النظام الأخير يعطي أفضل أداء منشور في مجموعة اختبار الوقت.

الوحدات المتكررة البسيطة (SRU) هي نموذج متكرر فعال وقابل للتوازي للغاية. إن أدائها على ASR أسوأ من طرز LSTM و GRU و LI-GRU القياسية ، ولكنه أسرع بكثير. يتم تنفيذ SRU هنا ووصفها في الورقة التالية:

T. Lei ، Y. Zhang ، Si Wang ، H. Dai ، Y. Artzi ، "وحدات متكررة بسيطة لتكرار قابلة للتوازي للغاية ، Proc. من EMNLP 2018. Arxiv

لإجراء تجارب مع هذا النموذج ، استخدم ملف config cfg/timit_baselines/timit_sru_fbank.cfg . قبل أن تقوم بتثبيت النموذج باستخدام pip install sru ويجب عليك إلغاء إلغاء "استيراد SRU" في neural_networks.py .

يمكنك مقارنة نتائجك مباشرةً مع الذهاب إلى هنا. في هذا المستودع الخارجي ، يمكنك العثور على جميع المجلدات التي تحتوي على الملفات التي تم إنشاؤها.

Librispeech البرنامج التعليمي

تشبه خطوات تشغيل Pytorch-Kaldi على مجموعة بيانات Librispeech تلك المذكورة أعلاه لتوقيت. يعتمد البرنامج التعليمي التالي على مجموعة فرعية 100 ساعة ، ولكن يمكن تمديده بسهولة إلى مجموعة البيانات الكاملة (960H).

قم بتشغيل وصفة Kaldi لـ Librispeech على الأقل حتى المرحلة 13 (مدرجة)
نسخ ملفات exp/tri4b/trans.* في exp/tri4b/decode_tgsmall_train_clean_100/

 mkdir exp/tri4b/decode_tgsmall_train_clean_100 && cp exp/tri4b/trans.* exp/tri4b/decode_tgsmall_train_clean_100/

حساب ميزات FMLLR عن طريق تشغيل البرنامج النصي التالي.

 . ./cmd.sh ## You'll want to change cmd.sh to something that will work on your system.
. ./path.sh ## Source the tools/utils (import the queue.pl)

gmmdir=exp/tri4b

for chunk in train_clean_100 dev_clean test_clean; do
    dir=fmllr/$chunk
    steps/nnet/make_fmllr_feats.sh --nj 10 --cmd "$train_cmd" 
        --transform-dir $gmmdir/decode_tgsmall_$chunk 
            $dir data/$chunk $gmmdir $dir/log $dir/data || exit 1

    compute-cmvn-stats --spk2utt=ark:data/$chunk/spk2utt scp:fmllr/$chunk/feats.scp ark:$dir/data/cmvn_speaker.ark
done

حساب aligmenents باستخدام:

 # aligments on dev_clean and test_clean
steps/align_fmllr.sh --nj 30 data/train_clean_100 data/lang exp/tri4b exp/tri4b_ali_clean_100
steps/align_fmllr.sh --nj 10 data/dev_clean data/lang exp/tri4b exp/tri4b_ali_dev_clean_100
steps/align_fmllr.sh --nj 10 data/test_clean data/lang exp/tri4b exp/tri4b_ali_test_clean_100

قم بإجراء التجارب مع الأمر التالي:

  python run_exp.py cfg/Librispeech_baselines/libri_MLP_fmllr.cfg

إذا كنت ترغب في استخدام نموذج متكرر ، فيمكنك استخدام libri_rnn_fmllr.cfg أو libri_lstm_fmllr.cfg أو libri_gru_fmllr.cfg أو libri_ligru_fmllr.cfg . قد يستغرق تدريب النماذج المتكررة بعض الأيام (اعتمادًا على وحدة معالجة الرسومات المعتمدة). تم الإبلاغ عن الأداء الذي تم الحصول عليه باستخدام الرسم البياني TGSMALL في الجدول التالي:

نموذج	٪ ٪
MLP	9.6
LSTM	8.6
جرو	8.6
li-gru	8.6

يتم الحصول على هذه النتائج دون إضافة شبكة تنقذ شعرية (أي ، باستخدام الرسم البياني TGSMALL فقط). يمكنك تحسين الأداء عن طريق إضافة شبكية إنقاذ بهذه الطريقة (قم بتشغيله من مجلد Kaldi_decoding_script في Pytorch-Kaldi):

 data_dir=/data/milatmp1/ravanelm/librispeech/s5/data/
dec_dir=/u/ravanelm/pytorch-Kaldi-new/exp/libri_fmllr/decode_test_clean_out_dnn1/
out_dir=/u/ravanelm/pytorch-kaldi-new/exp/libri_fmllr/

steps/lmrescore_const_arpa.sh  $data_dir/lang_test_{tgsmall,fglarge} 
          $data_dir/test_clean $dec_dir $out_dir/decode_test_clean_fglarge   || exit 1;

تم الإبلاغ عن النتائج النهائية التي يتم الحصول عليها باستخدام الإنقاذ ( FGLARGE ) في الجدول التالي:

نموذج	٪ ٪
MLP	6.5
LSTM	6.4
جرو	6.3
li-gru	6.2

يمكنك إلقاء نظرة على النتائج التي تم الحصول عليها هنا.

نظرة عامة على بنية مجموعة الأدوات

البرنامج النصي الرئيسي لتشغيل تجربة ASR هو Run_exp.py . هذا البرنامج النصي Python يؤدي التدريب ، والتحقق من الصحة ، والأمام ، وفك تشفير الخطوات. يتم تنفيذ التدريب على عدة عصر ، والتي تعالج بشكل تدريجي جميع المواد التدريبية مع الشبكة العصبية المدروسة. بعد كل فترة تدريب ، يتم تنفيذ خطوة التحقق من صحة لمراقبة أداء النظام على بيانات التوقف . في نهاية التدريب ، يتم تنفيذ المرحلة الأمامية عن طريق حساب الاحتمالات الخلفية لمجموعة بيانات الاختبار المحددة. يتم تطبيع الاحتمالات الخلفية من قبل priors (باستخدام ملف العد) وتخزينها في ملف ARK. ثم يتم تنفيذ خطوة فك التشفير لاسترداد التسلسل النهائي للكلمات التي ينطق بها المتحدث في جمل الاختبار.

يأخذ البرنامج النصي Run_exp.py إدخال ملف تكوين عالمي (على سبيل المثال ، CFG/Timit_MLP_MFCC.CFG ) الذي يحدد جميع الخيارات المطلوبة لتشغيل تجربة كاملة. يقوم Code Run_exp.py باستدعاء وظيفة أخرى Run_nn (انظر مكتبة Core.py) التي تؤدي التدريب والتحقق من الصحة والعمليات إلى كل جزء من البيانات. تأخذ الدالة RUN_NN إدخال ملف تكوين خاص بالجزء (على سبيل المثال ، EXP/Timit_MLP_MFCC/EXP_FILES/TRAIN_TIMIT_TR+TIMIT_DEV_EP000_CK00.CFG*) الذي يحدد جميع المعلمات اللازمة لتشغيل تجربة واحدة. تقوم وظيفة RUN_NN بإخراج بعض ملاءات المعلومات (على سبيل المثال ، exp/timit_mlp_mfcc/exp_files/train_timit_tr+timit_dev_ep000_ck00.info ) التي تلخص الخسائر وأخطاء القطع المعالجة.

يتم تلخيص النتائج في ملفات res.res ، في حين يتم إعادة توجيه الأخطاء والتحذيرات إلى ملف log.log .

وصف ملفات التكوين:

هناك نوعان من ملفات التكوين (ملفات CFG العالمية والمكتلة). كلاهما في تنسيق INI ويتم قراءته ومعالجته وتعديله مع مكتبة ConflipParser في Python. يحتوي الملف العالمي على العديد من الأقسام ، التي تحدد جميع الخطوات الرئيسية لتجارب التعرف على الكلام (التدريب ، والتحقق من الصحة ، والأمام ، وفك التشفير). تم وصف بنية ملف التكوين في ملف النموذج الأولي (انظر على سبيل المثال proto/global.proto ) لا يسرد جميع الأقسام والحقول المطلوبة فحسب ، بل تحدد أيضًا نوع كل حقل ممكن. على سبيل المثال ، يعني N_EP = int (1 ، inf) أن الحقول n_ep (أي عدد عصر التدريب) يجب أن تكون عددًا صحيحًا يتراوح من 1 إلى INF. وبالمثل ، فإن LR = تعويم (0 ، INF) يعني أن حقل LR (أي ، معدل التعلم) يجب أن يكون تعويمًا يتراوح من 0 إلى INF. أي محاولة لكتابة ملف التكوين غير متوافق مع هذه المواصفات سترفع خطأ.

دعنا الآن نحاول فتح ملف تكوين (على سبيل المثال ، CFG/Timit_baselines/timit_mlp_mfcc_basic.cfg ) ودعونا نصف الأقسام الرئيسية:

 [cfg_proto]
cfg_proto = proto/global.proto
cfg_proto_chunk = proto/global_chunk.proto

يحدد الإصدار الحالي من ملف config أولاً مسارات ملفات النموذج الأولي العالمي والمكتل في القسم [CFG_PROTO] .

 [exp]
cmd = 
run_nn_script = run_nn
out_folder = exp/TIMIT_MLP_basic5
seed = 1234
use_cuda = True
multi_gpu = False
save_gpumem = False
n_epochs_tr = 24

يحتوي القسم [exp] على بعض الحقول المهمة ، مثل مجلد الإخراج ( Out_Folder ) ومسار البرنامج النصي المعالجة المحدد Run_nn (بشكل افتراضي ، يجب تنفيذ هذه الوظيفة في مكتبة Core.py). يحدد الحقل N_EPOCHS_TR العدد المحدد من عصر التدريب. يمكن تمكين خيارات أخرى حول استخدام use_cuda و multi_gpu و save_gpumem بواسطة المستخدم. يمكن استخدام الحقل CMD لإلحاق أمر لتشغيل البرنامج النصي على مجموعة HPC.

 [dataset1]
data_name = TIMIT_tr
fea = fea_name=mfcc
    fea_lst=quick_test/data/train/feats_mfcc.scp
    fea_opts=apply-cmvn --utt2spk=ark:quick_test/data/train/utt2spk  ark:quick_test/mfcc/train_cmvn_speaker.ark ark:- ark:- | add-deltas --delta-order=2 ark:- ark:- |
    cw_left=5
    cw_right=5
    
lab = lab_name=lab_cd
    lab_folder=quick_test/dnn4_pretrain-dbn_dnn_ali
    lab_opts=ali-to-pdf
    lab_count_file=auto
    lab_data_folder=quick_test/data/train/
    lab_graph=quick_test/graph
    
n_chunks = 5

[dataset2]
data_name = TIMIT_dev
fea = fea_name=mfcc
    fea_lst=quick_test/data/dev/feats_mfcc.scp
    fea_opts=apply-cmvn --utt2spk=ark:quick_test/data/dev/utt2spk  ark:quick_test/mfcc/dev_cmvn_speaker.ark ark:- ark:- | add-deltas --delta-order=2 ark:- ark:- |
    cw_left=5
    cw_right=5
    
lab = lab_name=lab_cd
    lab_folder=quick_test/dnn4_pretrain-dbn_dnn_ali_dev
    lab_opts=ali-to-pdf
    lab_count_file=auto
    lab_data_folder=quick_test/data/dev/
    lab_graph=quick_test/graph
n_chunks = 1

[dataset3]
data_name = TIMIT_test
fea = fea_name=mfcc
    fea_lst=quick_test/data/test/feats_mfcc.scp
    fea_opts=apply-cmvn --utt2spk=ark:quick_test/data/test/utt2spk  ark:quick_test/mfcc/test_cmvn_speaker.ark ark:- ark:- | add-deltas --delta-order=2 ark:- ark:- |
    cw_left=5
    cw_right=5
    
lab = lab_name=lab_cd
    lab_folder=quick_test/dnn4_pretrain-dbn_dnn_ali_test
    lab_opts=ali-to-pdf
    lab_count_file=auto
    lab_data_folder=quick_test/data/test/
    lab_graph=quick_test/graph
    
n_chunks = 1

يحتوي ملف التكوين على عدد من الأقسام ( [Dataset1] ، [Dataset2] ، [Dataset3] ، ...) التي تصف جميع الشركات المستخدمة لتجربة ASR. تصف الحقول الموجودة في قسم [DataSet*] جميع الميزات والعلامات التي تم بحثها في التجربة. تم تحديد الميزات ، على سبيل المثال ، في الحقل FEA: ، حيث يحتوي Fea_name على الاسم المعطى للميزة ، FEA_LST هي قائمة الميزات (بتنسيق SCP Kaldi) ، يسمح FEA_OPTS للمستخدمين بتحديد كيفية معالجة الميزات (على سبيل المثال ، القيام بـ CMVN أو إضافة المشتقات) ، بينما CW_FT . للإلحاق). لاحظ أن الإصدار الحالي من مجموعة أدوات Pytorch-Kaldi يدعم تعريف تدفقات الميزات المتعددة. في الواقع ، كما هو موضح في CFG/Timit_baselines/timit_mfcc_fbank_fmllr_ligru_best.cfg يتم استخدام تدفقات ميزة متعددة (على سبيل المثال ، MFCC ، FBank ، FMLLR).

وبالمثل ، يحتوي قسم المختبر على بعض الحقول الفرعية. على سبيل المثال ، يشير LAB_NAME إلى الاسم المقدم للتسمية ، بينما يحتوي LAB_FOLDER على المجلد الذي يتم فيه تخزين المحاذاة التي تم إنشاؤها بواسطة وصفة Kaldi. يسمح Lab_opts للمستخدم بتحديد بعض الخيارات على المحاذاة المدروسة. على سبيل المثال ، يقوم LAB_OPTS = "Ali-to-PDF" باستخراج ملصقات حالة الهاتف المعتمدة على السياق ، في حين أن LAB_OPTS = ali to-phons-per-frame = يمكن استخدامه لاستخراج أهداف monophone. يتم استخدام lab_count_file لتحديد الملف الذي يحتوي على عدد حالات الهاتف المدروسة. تعتبر هذه التهم مهمة في المرحلة الأمامية ، حيث يتم تقسيم الاحتمالات الخلفية التي تحسبها الشبكة العصبية بواسطة صغارها. يتيح Pytorch-Kaldi للمستخدمين تحديد ملف حساب خارجي أو لاسترداده تلقائيًا (باستخدام Lab_Count_File = Auto ). يمكن للمستخدمين أيضًا تحديد LAB_COUNT_FILE = لا شيء إذا لم يكن ملف العد غير مطلوب بصرامة ، على سبيل المثال ، عندما تتوافق الملصقات مع الإخراج غير المستخدم لإنشاء الاحتمالات الخلفية المستخدمة في المرحلة الأمامية (انظر على سبيل المثال ، أهداف Monophone في CFG/Timit_Baselines/Timit_MLP_MFCC.CFG ). LAB_DATA_FOLDER ، بدلاً من ذلك ، يتوافق مع مجلد البيانات الذي تم إنشاؤه أثناء إعداد بيانات KALLI. أنه يحتوي على العديد من الملفات ، بما في ذلك الملف النصي المستخدم في النهاية لحساب WER النهائي. آخر مجال فرعي LAB_GRAPH هو مسار الرسم البياني KALDI المستخدم لإنشاء الملصقات.

عادةً ما تكون مجموعة البيانات الكاملة كبيرة ولا يمكن أن تناسب ذاكرة GPU/RAM. وبالتالي ينبغي تقسيمها إلى عدة قطع. يقوم Pytorch-Kaldi تلقائيًا بتقسيم مجموعة البيانات إلى عدد القطع المحددة في N_Chunks . قد يعتمد عدد القطع على مجموعة البيانات المحددة. بشكل عام ، نقترح معالجة أجزاء الكلام من حوالي ساعة أو ساعتين (اعتمادًا على الذاكرة المتاحة).

 [data_use]
train_with = TIMIT_tr
valid_with = TIMIT_dev
forward_with = TIMIT_test

يخبر هذا القسم كيف يتم استخدام البيانات المدرجة في الأقسام [مجموعات البيانات*] ضمن البرنامج النصي Run_exp.py . السطر الأول يعني أننا نقوم بتدريب مع البيانات التي تسمى Timit_Tr . لاحظ أن اسم مجموعة البيانات هذا يجب أن يظهر في أحد أقسام مجموعة البيانات ، وإلا فإن محلل التكوين سوف يرفع خطأ. وبالمثل ، يحدد السطران الثاني والثالث البيانات المستخدمة للتحقق من مراحل التحقق من الصحة والأمام ، على التوالي.

 [batches]
batch_size_train = 128
max_seq_length_train = 1000
increase_seq_length_train = False
start_seq_len_train = 100
multply_factor_seq_len_train = 2
batch_size_valid = 128
max_seq_length_valid = 1000

يتم استخدام batch_size_train لتحديد عدد أمثلة التدريب في دفعة صغيرة. الحقول max_seq_length_train تقطع الجمل لفترة أطول من القيمة المحددة. عند تدريب النماذج المتكررة على جمل طويلة جدًا ، قد تنشأ مشكلات خارج الذاكرة. باستخدام هذا الخيار ، نسمح للمستخدمين بتخفيف مشكلات الذاكرة هذه عن طريق اقتطاع الجمل الطويلة. علاوة على ذلك ، من الممكن تنمية الحد الأقصى لطول الجملة بشكل تدريجي أثناء التدريب عن طريق تحديد student_seq_length_train = صحيح . في حالة تمكينه ، يبدأ التدريب بأقصى طول على الجملة المحددة في Start_Seq_Len_Train (على سبيل المثال ، start_seq_len_train = 100 ). بعد كل عصر ، يتم ضرب طول الجملة القصوى بواسطة multply_factor_seq_len_train (على سبيل المثال multply_factor_seq_len_train = 2 ). لقد لاحظنا أن هذه الاستراتيجية البسيطة تعمل بشكل عام على تحسين أداء النظام لأنها تشجع النموذج على التركيز أولاً على التبعيات قصيرة الأجل وتعلم تلك الطويلة الأجل فقط في مرحلة لاحقة.

وبالمثل ، حدد batch_size_valid و max_seq_length_valid عدد الأمثلة في الطبقة الصغيرة والطول القصوى لمجموعة بيانات dev.

 [architecture1]
arch_name = MLP_layers1
arch_proto = proto/MLP.proto
arch_library = neural_networks
arch_class = MLP
arch_pretrain_file = none
arch_freeze = False
arch_seq_model = False
dnn_lay = 1024,1024,1024,1024,N_out_lab_cd
dnn_drop = 0.15,0.15,0.15,0.15,0.0
dnn_use_laynorm_inp = False
dnn_use_batchnorm_inp = False
dnn_use_batchnorm = True,True,True,True,False
dnn_use_laynorm = False,False,False,False,False
dnn_act = relu,relu,relu,relu,softmax
arch_lr = 0.08
arch_halving_factor = 0.5
arch_improvement_threshold = 0.001
arch_opt = sgd
opt_momentum = 0.0
opt_weight_decay = 0.0
opt_dampening = 0.0
opt_nesterov = False

يتم استخدام الأقسام [العمارة*] لتحديد بنية الشبكات العصبية المشاركة في تجارب ASR. يحدد الحقل Arch_name اسم الهندسة المعمارية. نظرًا لأن الشبكات العصبية المختلفة يمكن أن تعتمد على مجموعة مختلفة من المقاييس المفرطة ، يتعين على المستخدم إضافة مسار ملف proto الذي يحتوي على قائمة أجهزة HyperParameters في Proto الحقل. على سبيل المثال ، يحتوي ملف النموذج الأولي لنموذج MLP القياسي على الحقول التالية:

 [proto]
library=path
class=MLP
dnn_lay=str_list
dnn_drop=float_list(0.0,1.0)
dnn_use_laynorm_inp=bool
dnn_use_batchnorm_inp=bool
dnn_use_batchnorm=bool_list
dnn_use_laynorm=bool_list
dnn_act=str_list

على غرار ملفات النموذج الأولي الآخر ، يحدد كل سطر مقياسًا فرطًا بنوع القيمة ذات الصلة. يجب أن تظهر جميع المقاييس المفرطة المحددة في ملف Proto في ملف التكوين العالمي ضمن قسم [Architecture*] المقابل. يحدد حقل Arch_library مكان ترميز النموذج (على سبيل المثال neural_nets.py ) ، بينما يشير Arch_class إلى اسم الفصل حيث يتم تنفيذ البنية (على سبيل المثال ، إذا حددنا الفئة = MLP ، فسنفعل من neural_nets.py استيراد MLP ).

يمكن استخدام الحقل Arch_pretrain_file لتدريب الشبكة العصبية مسبقًا مع بنية مدربة مسبقًا ، في حين يمكن ضبط Arch_freeze على خطأ إذا كنت ترغب في تدريب معلمات البنية أثناء التدريب ويجب ضبطها على True ، هل تبقي المعلمات ثابتة (أي ، المجمدة) أثناء التدريب. يشير القسم arch_seq_model إلى ما إذا كانت البنية متسلسلة (مثل RNNs) أو غير متسلسل (على سبيل المثال ، MLP-for-for-for-for-for-for-for-for-for-cnn). الطريقة التي تقوم بها Pytorch-Kaldi بمعالجة دفعات الإدخال في حالتين. بالنسبة للشبكات العصبية المتكررة ( ARCH_SEQ_MODEL = صواب ) ، لا يتم اختيار تسلسل الميزات (للحفاظ على عناصر التسلسل) ، بينما بالنسبة للنماذج العذوفة ( ARCH_SEQ_MODEL = FALSE ) ، فإننا نحافظ على عشوائيات الميزات (هذا يساعد عادة على تحسين الأداء). في حالة البنى المتعددة ، يتم استخدام المعالجة المتسلسلة إذا تم تمييز واحد على الأقل من البنية المستخدمة على أنها متسلسلة ( Arch_seq_model = true ).

لاحظ أن المقاييس المفرطة التي تبدأ بـ "Arch_" و "OPT_" إلزامية ويجب أن تكون موجودة في جميع البنية المحددة في ملف التكوين. تعد المقاييس الفرعية الأخرى (على سبيل المثال ، DNN_*،) محددة للعمارة المدروسة (فهي تعتمد على كيفية تنفيذ الفئة MLP بالفعل من قبل المستخدم) ويمكنها تحديد عدد وتصنيف الطبقات المخفية ، وتطبيع الدُفعات والطبقة ، وغيرها من المعلمات. ترتبط المعلمات المهمة الأخرى بتحسين العمارة المدروسة. على سبيل المثال ، Arch_lr هو معدل التعلم ، بينما يتم استخدام Arch_halving_factor لتنفيذ الصلب معدل التعلم. على وجه الخصوص ، عندما يكون تحسين الأداء النسبي على مجموعة dev بين حقوقتين متتاليين أصغر من تلك المحددة في Arch_improvement_threshold (على سبيل المثال ، Arch_improvement_threshold) ، نضاعف معدل التعلم بواسطة Arch_halving_factor (على سبيل المثال ، Arch_halving_factor = 0.5 ). يحدد الحقل Arch_opt نوع خوارزمية التحسين. نحن ندعم حاليًا SGD و Adam و RMSPROP. المعلمات الأخرى محددة لخوارزمية التحسين المدروسة (انظر وثائق Pytorch للمعنى الدقيق لجميع أجهزة التحكم في التحسين الخاصة). لاحظ أن البنية المختلفة المحددة في [chrusectecture*] يمكن أن تحتوي على فرط علامات التحسين المختلفة ويمكنها حتى استخدام خوارزمية تحسين مختلفة.

 [model]
model_proto = proto/model.proto
model = out_dnn1=compute(MLP_layers1,mfcc)
    loss_final=cost_nll(out_dnn1,lab_cd)
    err_final=cost_err(out_dnn1,lab_cd)

يتم تحديد الطريقة التي يتم بها دمج جميع الميزات والبنية المختلفة في هذا القسم مع لغة تلوي بسيطة للغاية وبديهية. نموذج الحقل: يصف كيفية توصيل الميزات والبنية لإنشاء مجموعة من الاحتمالات الخلفية. يعني السطر Out_dnn1 = compute (MLP_Layers ، MFCC) " إطعام البنية المسمى MLP_Layers1 مع الميزات التي تسمى MFCC وتخزين الإخراج في المتغير out_dnn1 ". من إخراج الشبكة العصبية Out_dnn1 ، يتم حساب الخطأ ووظائف الخسارة باستخدام الملصقات التي تسمى lab_cd ، والتي يجب تحديدها مسبقًا في الأقسام [مجموعات البيانات*] . حقول ERR_FINAL و LOSS_FINAL هي حقول فرعية إلزامية تحدد الإخراج النهائي للنموذج.

تم الإبلاغ عن مثال أكثر تعقيدًا (تمت مناقشته هنا فقط لتسليط الضوء على إمكانات مجموعة الأدوات) في CFG/Timit_baselines/timit_mfcc_fbank_fmllr_ligru_best.cfg :

 [model]
model_proto=proto/model.proto
model:conc1=concatenate(mfcc,fbank)
      conc2=concatenate(conc1,fmllr)
      out_dnn1=compute(MLP_layers_first,conc2)
      out_dnn2=compute(liGRU_layers,out_dnn1)
      out_dnn3=compute(MLP_layers_second,out_dnn2)
      out_dnn4=compute(MLP_layers_last,out_dnn3)
      out_dnn5=compute(MLP_layers_last2,out_dnn3)
      loss_mono=cost_nll(out_dnn5,lab_mono)
      loss_mono_w=mult_constant(loss_mono,1.0)
      loss_cd=cost_nll(out_dnn4,lab_cd)
      loss_final=sum(loss_cd,loss_mono_w)     
      err_final=cost_err(out_dnn4,lab_cd)

في هذه الحالة ، نقوم أولاً بتسلسل ميزات MFCC و FBANK و FMLLR ثم نطعم MLP. يتم تغذية إخراج MLP في الشبكة العصبية المتكررة (وتحديدا نموذج LI-GRU). لدينا بعد ذلك طبقة MLP أخرى ( MLP_LAYERS_SECOND ) تليها اثنين من مصنفات softmax (IE ، MLP_LAYERS_LAST ، MLP_LAYERS_LAST2 ). أول واحد يقدر الحالات المعتمدة على السياق القياسي ، في حين أن التقديرات الثانية أهداف أحادي. وظيفة التكلفة النهائية هي مبلغ مرجح بين هذين التنبؤين. وبهذه الطريقة ، نقوم بتنفيذ تنظيم monophone ، تبين أنه مفيد لتحسين أداء ASR.

يمكن اعتبار النموذج الكامل برسمًا بيانيًا حسابيًا كبيرًا واحدًا ، حيث يتم تدريب جميع البنى الأساسية المستخدمة في قسم [النموذج] بشكل مشترك. لكل دفعة مصغرة ، يتم نشر ميزات الإدخال من خلال النموذج الكامل ويتم حساب Cost_Final باستخدام الملصقات المحددة. ثم يتم حساب التدرج لوظيفة التكلفة فيما يتعلق بجميع المعلمات القابلة للتعلم للعمارة. ثم يتم تحديث جميع معلمات البنية المستخدمة مع الخوارزمية المحددة في أقسام [Architecture*] .

 [forward]
forward_out = out_dnn1
normalize_posteriors = True
normalize_with_counts_from = lab_cd
save_out_file = True
require_decoding = True

يحدد القسم للأمام أولاً ما هو الإخراج المراد التوجيه (يجب تعريفه في قسم النموذج). إذا كانت abortalize_posteriors = صحيح ، يتم تطبيع هذه الخلفية بواسطة أجهزةهم (باستخدام ملف العد). إذا تم تخزين save_out_file = true ، يتم تخزين الملف الخلفي (عادةً ملف ARK كبير جدًا) ، بينما إذا تم حذف Save_out_file = خطأ عند عدم الحاجة بعد الآن. requist_decoding هو منطقي يحدد ما إذا كنا بحاجة إلى فك تشفير الإخراج المحدد. الحقل almitalize_with_counts_from الذي يحسب باستخدام لتطبيع الاحتمالات الخلفية.

 [decoding]
decoding_script_folder = kaldi_decoding_scripts/
decoding_script = decode_dnn.sh
decoding_proto = proto/decoding.proto
min_active = 200
max_active = 7000
max_mem = 50000000
beam = 13.0
latbeam = 8.0
acwt = 0.2
max_arcs = -1
skip_scoring = false
scoring_script = local/score.sh
scoring_opts = "--min-lmwt 1 --max-lmwt 10"
norm_vars = False

يبلغ قسم فك التشفير معلمات حول فك التشفير ، أي الخطوات التي تسمح للمرء بالمرور من سلسلة من الاحتمالات المعتمدة على السياق التي توفرها DNN في سلسلة من الكلمات. يحدد الحقل decoding_script_folder المجلد حيث يتم تخزين البرنامج النصي فك التشفير. The decoding script field is the script used for decoding (eg, decode_dnn.sh ) that should be in the decoding_script_folder specified before. The field decoding_proto reports all the parameters needed for the considered decoding script.

To make the code more flexible, the config parameters can also be specified within the command line. For example, you can run:

 python run_exp.py quick_test/example_newcode.cfg --optimization,lr=0.01 --batches,batch_size=4

The script will replace the learning rate in the specified cfg file with the specified lr value. The modified config file is then stored into out_folder/config.cfg .

The script run_exp.py automatically creates chunk-specific config files, that are used by the run_nn function to perform a single chunk training. The structure of chunk-specific cfg files is very similar to that of the global one. The main difference is a field to_do={train, valid, forward} that specifies the type of processing to on the features chunk specified in the field fea .

Why proto files? Different neural networks, optimization algorithms, and HMM decoders might depend on a different set of hyperparameters. To address this issue, our current solution is based on the definition of some prototype files (for global, chunk, architecture config files). In general, this approach allows a more transparent check of the fields specified into the global config file. Moreover, it allows users to easily add new parameters without changing any line of the python code. For instance, to add a user-defined model, a new proto file (eg, user-model.prot o) that specifies the hyperparameter must be written. Then, the user should only write a class (eg, user-model in neural_networks.py ) that implements the architecture).

[FAQs]

How can I plug-in my model

The toolkit is designed to allow users to easily plug-in their own acoustic models. To add a customized neural model do the following steps:

Go into the proto folder and create a new proto file (eg, proto/myDNN.proto ). The proto file is used to specify the list of the hyperparameters of your model that will be later set into the configuration file. To have an idea about the information to add to your proto file, you can take a look into the MLP.proto file:

 [proto]
dnn_lay=str_list
dnn_drop=float_list(0.0,1.0)
dnn_use_laynorm_inp=bool
dnn_use_batchnorm_inp=bool
dnn_use_batchnorm=bool_list
dnn_use_laynorm=bool_list
dnn_act=str_list

The parameter dnn_lay must be a list of string, dnn_drop (ie, the dropout factors for each layer) is a list of float ranging from 0.0 and 1.0, dnn_use_laynorm_inp and dnn_use_batchnorm_inp are booleans that enable or disable batch or layer normalization of the input. dnn_use_batchnorm and dnn_use_laynorm are a list of boolean that decide layer by layer if batch/layer normalization has to be used. The parameter dnn_act is again a list of string that sets the activation function of each layer. Since every model is based on its own set of hyperparameters, different models have a different prototype file. For instance, you can take a look into GRU.proto and see that the hyperparameter list is different from that of a standard MLP. Similarly to the previous examples, you should add here your list of hyperparameters and save the file.
Write a PyTorch class implementing your model. Open the library neural_networks.py and look at some of the models already implemented. For simplicity, you can start taking a look into the class MLP. The classes have two mandatory methods: init and forward . The first one is used to initialize the architecture, the second specifies the list of computations to do. The method init takes in input two variables that are automatically computed within the run_nn function. inp_dim is simply the dimensionality of the neural network input, while options is a dictionary containing all the parameters specified into the section architecture of the configuration file.
For instance, you can access to the DNN activations of the various layers in this way: options['dnn_lay'].split(',') . As you might see from the MLP class, the initialization method defines and initializes all the parameters of the neural network. The forward method takes in input a tensor x (ie, the input data) and outputs another vector containing x. If your model is a sequence model (ie, if there is at least one architecture with arch_seq_model=true in the cfg file), x is a tensor with (time_steps, batches, N_in), otherwise is a (batches, N_in) matrix. The class forward defines the list of computations to transform the input tensor into a corresponding output tensor. The output must have the sequential format (time_steps, batches, N_out) for recurrent models and the non-sequential format (batches, N_out) for feed-forward models. Similarly to the already-implemented models the user should write a new class (eg, myDNN) that implements the customized model:

 class myDNN(nn.Module):
    
    def __init__(self, options,inp_dim):
        super(myDNN, self).__init__()
             // initialize the parameters

            def forward(self, x):
                 // do some computations out=f(x)
                  return out

Create a configuration file. Now that you have defined your model and the list of its hyperparameters, you can create a configuration file. To create your own configuration file, you can take a look into an already existing config file (eg, for simplicity you can consider cfg/TIMIT_baselines/TIMIT_MLP_mfcc_basic.cfg ). After defining the adopted datasets with their related features and labels, the configuration file has some sections called [architecture*] . Each architecture implements a different neural network. In cfg/TIMIT_baselines/TIMIT_MLP_mfcc_basic.cfg we only have [architecture1] since the acoustic model is composed of a single neural network. To add your own neural network, you have to write an architecture section (eg, [architecture1] ) in the following way:

 [architecture1]
arch_name= mynetwork (this is a name you would like to use to refer to this architecture within the following model section)
arch_proto=proto/myDNN.proto (here is the name of the proto file defined before)
arch_library=neural_networks (this is the name of the library where myDNN is implemented)
arch_class=myDNN (This must be the name of the  class you have implemented)
arch_pretrain_file=none (With this you can specify if you want to pre-train your model)
arch_freeze=False (set False if you want to update the parameters of your model)
arch_seq_model=False (set False for feed-forward models, True for recurrent models)

Then, you have to specify proper values for all the hyperparameters specified in proto/myDNN.proto . For the MLP.proto , we have:

 dnn_lay=1024,1024,1024,1024,1024,N_out_lab_cd
dnn_drop=0.15,0.15,0.15,0.15,0.15,0.0
dnn_use_laynorm_inp=False
dnn_use_batchnorm_inp=False
dnn_use_batchnorm=True,True,True,True,True,False
dnn_use_laynorm=False,False,False,False,False,False
dnn_act=relu,relu,relu,relu,relu,softmax

Then, add the following parameters related to the optimization of your own architecture. You can use here standard sdg, adam, or rmsprop (see cfg/TIMIT_baselines/TIMIT_LSTM_mfcc.cfg for an example with rmsprop):

 arch_lr=0.08
arch_halving_factor=0.5
arch_improvement_threshold=0.001
arch_opt=sgd
opt_momentum=0.0
opt_weight_decay=0.0
opt_dampening=0.0
opt_nesterov=False

Save the configuration file into the cfg folder (eg, cfg/myDNN_exp.cfg ).
Run the experiment with:

 python run_exp.py cfg/myDNN_exp.cfg

To debug the model you can first take a look at the standard output. The config file is automatically parsed by the run_exp.py and it raises errors in case of possible problems. You can also take a look into the log.log file to see additional information on the possible errors.

When implementing a new model, an important debug test consists of doing an overfitting experiment (to make sure that the model is able to overfit a tiny dataset). If the model is not able to overfit, it means that there is a major bug to solve.

Hyperparameter tuning. In deep learning, it is often important to play with the hyperparameters to find the proper setting for your model. This activity is usually very computational and time-consuming but is often necessary when introducing new architectures. To help hyperparameter tuning, we developed a utility that implements a random search of the hyperparameters (see next section for more details).

How can I tune the hyperparameters

A hyperparameter tuning is often needed in deep learning to search for proper neural architectures. To help tuning the hyperparameters within PyTorch-Kaldi, we have implemented a simple utility that implements a random search. In particular, the script tune_hyperparameters.py generates a set of random configuration files and can be run in this way:

 python tune_hyperparameters.py cfg/TIMIT_MLP_mfcc.cfg exp/TIMIT_MLP_mfcc_tuning 10 arch_lr=randfloat(0.001,0.01) batch_size_train=randint(32,256) dnn_act=choose_str{relu,relu,relu,relu,softmax|tanh,tanh,tanh,tanh,softmax}

The first parameter is the reference cfg file that we would like to modify, while the second one is the folder where the random configuration files are saved. The third parameter is the number of the random config file that we would like to generate. There is then the list of all the hyperparameters that we want to change. For instance, arch_lr=randfloat(0.001,0.01) will replace the field arch_lr with a random float ranging from 0.001 to 0.01. batch_size_train=randint(32,256) will replace batch_size_train with a random integer between 32 and 256 and so on. Once the config files are created, they can be run sequentially or in parallel with:

 python run_exp.py $cfg_file

How can I use my own dataset

PyTorch-Kaldi can be used with any speech dataset. To use your own dataset, the steps to take are similar to those discussed in the TIMIT/Librispeech tutorials. In general, what you have to do is the following:

Run the Kaldi recipe with your dataset. Please, see the Kaldi website to have more information on how to perform data preparation.
Compute the alignments on training, validation, and test data.
Write a PyTorch-Kaldi config file $cfg_file .
Run the config file with python run_exp.py $cfg_file .

How can I plug-in my own features

The current version of PyTorch-Kaldi supports input features stored with the Kaldi ark format. If the user wants to perform experiments with customized features, the latter must be converted into the ark format. Take a look into the Kaldi-io-for-python git repository (https://github.com/vesis84/kaldi-io-for-python) for a detailed description about converting numpy arrays into ark files. Moreover, you can take a look into our utility called save_raw_fea.py. This script generates Kaldi ark files containing raw features, that are later used to train neural networks fed by the raw waveform directly (see the section about processing audio with SincNet).

How can I transcript my own audio files

The current version of Pytorch-Kaldi supports the standard production process of using a Pytorch-Kaldi pre-trained acoustic model to transcript one or multiples .wav files. It is important to understand that you must have a trained Pytorch-Kaldi model. While you don't need labels or alignments anymore, Pytorch-Kaldi still needs many files to transcript a new audio file:

The features and features list feats.scp (with .ark files, see #how-can-i-plug-my-own-features)
The decoding graph (usually created with mkgraph.sh during previous model training such as triphones models). This graph is not needed if you're not decoding.

Once you have all these files, you can start adding your dataset section to the global configuration file. The easiest way is to copy the cfg file used to train your acoustic model and just modify by adding a new [dataset] :

 [dataset4]
data_name = myWavFile
fea = fea_name=fbank
  fea_lst=myWavFilePath/data/feats.scp
  fea_opts=apply-cmvn --utt2spk=ark:myWavFilePath/data//utt2spk  ark:myWavFilePath/cmvn_test.ark ark:- ark:- | add-deltas --delta-order=0 ark:- ark:- |
  cw_left=5
  cw_right=5

lab = lab_name=none
  lab_data_folder=myWavFilePath/data/
  lab_graph=myWavFilePath/exp/tri3/graph
n_chunks=1

[data_use]
train_with = TIMIT_tr
valid_with = TIMIT_dev
forward_with = myWavFile

The key string for your audio file transcription is lab_name=none . The none tag asks Pytorch-Kaldi to enter a production mode that only does the forward propagation and decoding without any labels. You don't need TIMIT_tr and TIMIT_dev to be on your production server since Pytorch-Kaldi will skip this information to directly go to the forward phase of the dataset given in the forward_with field. As you can see, the global fea field requires the exact same parameters than standard training or testing dataset, while the lab field only requires two parameters. Please, note that lab_data_folder is nothing more than the same path as fea_lst . Finally, you still need to specify the number of chunks you want to create to process this file (1 hour = 1 chunk).
تحذيرات
In your standard .cfg, you might have used keywords such as N_out_lab_cd that can not be used anymore. Indeed, in a production scenario, you don't want to have the training data on your machine. Therefore, all the variables that were on your .cfg file must be replaced by their true values. To replace all the N_out_{mono,lab_cd} you can take a look at the output of:

 hmm-info /path/to/the/final.mdl/used/to/generate/the/training/ali

Then, if you normalize posteriors as (check in your .cfg Section forward):

 normalize_posteriors = True
normalize_with_counts_from = lab_cd

You must replace lab_cd by:

 normalize_posteriors = True
normalize_with_counts_from = /path/to/ali_train_pdf.counts

This normalization step is crucial for HMM-DNN speech recognition. DNNs, in fact, provide posterior probabilities, while HMMs are generative models that work with likelihoods. To derive the required likelihoods, one can simply divide the posteriors by the prior probabilities. To create this ali_train_pdf.counts file you can follow:

 alidir=/path/to/the/exp/tri_ali (change it with your path to the exp with the ali)
num_pdf=$(hmm-info $alidir/final.mdl | awk '/pdfs/{print $4}')
labels_tr_pdf="ark:ali-to-pdf $alidir/final.mdl "ark:gunzip -c $alidir/ali.*.gz |" ark:- |"
analyze-counts --verbose=1 --binary=false --counts-dim=$num_pdf "$labels_tr_pdf" ali_train_pdf.counts

et voilà ! In a production scenario, you might need to transcript a huge number of audio files, and you don't want to create as much as needed .cfg file. In this extent, and after creating this initial production .cfg file (you can leave the path blank), you can call the run_exp.py script with specific arguments referring to your different.wav features:

 python run_exp.py cfg/TIMIT_baselines/TIMIT_MLP_fbank_prod.cfg --dataset4,fea,0,fea_lst="myWavFilePath/data/feats.scp" --dataset4,lab,0,lab_data_folder="myWavFilePath/data/" --dataset4,lab,0,lab_graph="myWavFilePath/exp/tri3/graph/"

This command will internally alter the configuration file with your specified paths, and run and your defined features! Note that passing long arguments to the run_exp.py script requires a specific notation. --dataset4 specifies the name of the created section, fea is the name of the higher level field, fea_lst or lab_graph are the name of the lowest level field you want to change. The 0 is here to indicate which lowest level field you want to alter, indeed some configuration files may contain multiple lab_graph per dataset! Therefore, 0 indicates the first occurrence, 1 the second ... Paths MUST be encapsulated by " " to be interpreted as full strings! Note that you need to alter the data_name and forward_with fields if you don't want different .wav files transcriptions to erase each other (decoding files are stored accordingly to the field data_name ). --dataset4,data_name=MyNewName --data_use,forward_with=MyNewName .

Batch size, learning rate, and dropout scheduler

In order to give users more flexibility, the latest version of PyTorch-Kaldi supports scheduling of the batch size, max_seq_length_train, learning rate, and dropout factor. This means that it is now possible to change these values during training. To support this feature, we implemented the following formalisms within the config files:

 batch_size_train = 128*12 | 64*10 | 32*2

In this case, our batch size will be 128 for the first 12 epochs, 64 for the following 10 epochs, and 32 for the last two epochs. By default "*" means "for N times", while "|" is used to indicate a change of the batch size. Note that if the user simply sets batch_size_train = 128 , the batch size is kept fixed during all the training epochs by default.

A similar formalism can be used to perform learning rate scheduling:

 arch_lr = 0.08*10|0.04*5|0.02*3|0.01*2|0.005*2|0.0025*2

In this case, if the user simply sets arch_lr = 0.08 the learning rate is annealed with the new-bob procedure used in the previous version of the toolkit. In practice, we start from the specified learning rate and we multiply it by a halving factor every time that the improvement on the validation dataset is smaller than the threshold specified in the field arch_improvement_threshold .

Also the dropout factor can now be changed during training with the following formalism:

 dnn_drop = 0.15*12|0.20*12,0.15,0.15*10|0.20*14,0.15,0.0

With the line before we can set a different dropout rate for different layers and for different epochs. For instance, the first hidden layer will have a dropout rate of 0.15 for the first 12 epochs, and 0.20 for the other 12. The dropout factor of the second layer, instead, will remain constant to 0.15 over all the training. The same formalism is used for all the layers. Note that "|" indicates a change in the dropout factor within the same layer, while "," indicates a different layer.

You can take a look here into a config file where batch sizes, learning rates, and dropout factors are changed here:

 cfg/TIMIT_baselines/TIMIT_mfcc_basic_flex.cfg

or here:

 cfg/TIMIT_baselines/TIMIT_liGRU_fmllr_lr_schedule.cfg

How can I contribute to the project

The project is still in its initial phase and we invite all potential contributors to participate. We hope to build a community of developers larger enough to progressively maintain, improve, and expand the functionalities of our current toolkit. For instance, it could be helpful to report any bug or any suggestion to improve the current version of the code. People can also contribute by adding additional neural models, that can eventually make richer the set of currently-implemented architectures.

[إضافي]

Speech recognition from the raw waveform with SincNet

ألق نظرة على مقدمة الفيديو الخاصة بنا إلى Sincnet

SincNet is a convolutional neural network recently proposed to process raw audio waveforms. In particular, SincNet encourages the first layer to discover more meaningful filters by exploiting parametrized sinc functions. In contrast to standard CNNs, which learn all the elements of each filter, only low and high cutoff frequencies of band-pass filters are directly learned from data. This inductive bias offers a very compact way to derive a customized filter-bank front-end, that only depends on some parameters with a clear physical meaning.

For a more detailed description of the SincNet model, please refer to the following papers:

M. Ravanelli, Y. Bengio, "Speaker Recognition from raw waveform with SincNet", in Proc. of SLT 2018 ArXiv
M. Ravanelli, Y.Bengio, "Interpretable Convolutional Filters with SincNet", in Proc. of NIPS@IRASL 2018 ArXiv

To use this model for speech recognition on TIMIT, to the following steps:

Follows the steps described in the “TIMIT tutorial”.
Save the raw waveform into the Kaldi ark format. To do it, you can use the save_raw_fea.py utility in our repository. The script saves the input signals into a binary Kaldi archive, keeping the alignments with the pre-computed labels. You have to run it for all the data chunks (eg, train, dev, test). It can also specify the length of the speech chunk ( sig_wlen=200 # ms ) composing each frame.
Open the cfg/TIMIT_baselines/TIMIT_SincNet_raw.cfg , change your paths, and run:

 python ./run_exp.py cfg/TIMIT_baselines/TIMIT_SincNet_raw.cfg

With this architecture, we have obtained a PER(%)=17.1% . A standard CNN fed the same features gives us a PER(%)=18.% . Please, see here to take a look into our results. Our results on SincNet outperforms results obtained with MFCCs and FBANKs fed by standard feed-forward networks.

In the following table, we compare the result of SincNet with other feed-forward neural network:

نموذج	WER(%)
MLP -fbank	18.7
MLP -mfcc	18.2
CNN -raw	18.1
SincNet -raw	17.2

Joint training between speech enhancement and ASR

In this section, we show how to use PyTorch-Kaldi to jointly train a cascade between a speech enhancement and a speech recognition neural networks. The speech enhancement has the goal of improving the quality of the speech signal by minimizing the MSE between clean and noisy features. The enhanced features then feed another neural network that predicts context-dependent phone states.

In the following, we report a toy-task example based on a reverberated version of TIMIT, that is only intended to show how users should set the config file to train such a combination of neural networks. Even though some implementation details (and the adopted datasets) are different, this tutorial is inspired by this paper:

M. Ravanelli, P. Brakel, M. Omologo, Y. Bengio, "Batch-normalized joint training for DNN-based distant speech recognition", in Proceedings of STL 2016 arXiv

To run the system do the following steps:

1- Make sure you have the standard clean version of TIMIT available.

2- Run the Kaldi s5 baseline of TIMIT. This step is necessary to compute the clean features (that will be the labels of the speech enhancement system) and the alignments (that will be the labels of the speech recognition system). We recommend running the full timit s5 recipe (including the DNN training).

3- The standard TIMIT recipe uses MFCCs features. In this tutorial, instead, we use FBANK features. To compute FBANK features run the following script in $KALDI_ROOT/egs/TIMIT/s5 :

 feadir=fbank

for x in train dev test; do
  steps/make_fbank.sh --cmd "$train_cmd" --nj $feats_nj data/$x exp/make_fbank/$x $feadir
  steps/compute_cmvn_stats.sh data/$x exp/make_fbank/$x $feadir
done

Note that we use 40 FBANKS here, while Kaldi uses by default 23 FBANKs. To compute 40-dimensional features go into "$KALDI_ROOT/egs/TIMIT/conf/fbank.conf" and change the number of considered output filters.

4- Go to this external repository and follow the steps to generate a reverberated version of TIMIT starting from the clean one. Note that this is just a toy task that is only helpful to show how setting up a joint-training system.

5- Compute the FBANK features for the TIMIT_rev dataset. To do it, you can copy the scripts in $KALDI_ROOT/egs/TIMIT/ into $KALDI_ROOT/egs/TIMIT_rev/ . Please, copy also the data folder. Note that the audio files in the TIMIT_rev folders are saved with the standard WAV format, while TIMIT is released with the SPHERE format. To bypass this issue, open the files data/train/wav.scp , data/dev/wav.scp , data/test/wav.scp and delete the part about SPHERE reading (eg, /home/mirco/kaldi-trunk/tools/sph2pipe_v2.5/sph2pipe -f wav ). You also have to change the paths from the standard TIMIT to the reverberated one (eg replace /TIMIT/ with /TIMIT_rev/). Remind to remove the final pipeline symbol“ |”. Save the changes and run the computation of the fbank features in this way:

 feadir=fbank

for x in train dev test; do
  steps/make_fbank.sh --cmd "$train_cmd" --nj $feats_nj data/$x exp/make_fbank/$x $feadir
  steps/compute_cmvn_stats.sh data/$x exp/make_fbank/$x $feadir
done

Remember to change the $KALDI_ROOT/egs/TIMIT_rev/conf/fbank.conf file in order to compute 40 features rather than the 23 FBANKS of the default configuration.

6- Once features are computed, open the following config file:

 cfg/TIMIT_baselines/TIMIT_rev/TIMIT_joint_training_liGRU_fbank.cfg

Remember to change the paths according to where data are stored in your machine. As you can see, we consider two types of features. The fbank_rev features are computed from the TIMIT_rev dataset, while the fbank_clean features are derived from the standard TIMIT dataset and are used as targets for the speech enhancement neural network. As you can see in the [model] section of the config file, we have the cascade between networks doing speech enhancement and speech recognition. The speech recognition architecture jointly estimates both context-dependent and monophone targets (thus using the so-called monophone regularization). To run an experiment type the following command:

 python run_exp.py  cfg/TIMIT_baselines/TIMIT_rev/TIMIT_joint_training_liGRU_fbank.cfg

7- Results With this configuration file, you should obtain a Phone Error Rate (PER)=28.1% . Note that some oscillations around this performance are more than natural and are due to different initialization of the neural parameters.

You can take a closer look into our results here

Distant Speech Recognition with DIRHA

In this tutorial, we use the DIRHA-English dataset to perform a distant speech recognition experiment. The DIRHA English Dataset is a multi-microphone speech corpus being developed under the EC project DIRHA. The corpus is composed of both real and simulated sequences recorded with 32 sample-synchronized microphones in a domestic environment. The database contains signals of different characteristics in terms of noise and reverberation making it suitable for various multi-microphone signal processing and distant speech recognition tasks. The part of the dataset currently released is composed of 6 native US speakers (3 Males, 3 Females) uttering 409 wall-street journal sentences. The training data have been created using a realistic data contamination approach, that is based on contaminating the clean speech wsj-5k sentences with high-quality multi-microphone impulse responses measured in the targeted environment. For more details on this dataset, please refer to the following papers:

M. Ravanelli, L. Cristoforetti, R. Gretter, M. Pellin, A. Sosi, M. Omologo, "The DIRHA-English corpus and related tasks for distant-speech recognition in domestic environments", in Proceedings of ASRU 2015. ArXiv
M. Ravanelli, P. Svaizer, M. Omologo, "Realistic Multi-Microphone Data Simulation for Distant Speech Recognition", in Proceedings of Interspeech 2016. ArXiv

In this tutorial, we use the aforementioned simulated data for training (using LA6 microphone), while test is performed using the real recordings (LA6). This task is very realistic, but also very challenging. The speech signals are characterized by a reverberation time of about 0.7 seconds. Non-stationary domestic noises (such as vacuum cleaner, steps, phone rings, etc.) are also present in the real recordings.

Let's start now with the practical tutorial.

1- If not available, download the DIRHA dataset from the LDC website. LDC releases the full dataset for a small fee.

2- Go this external reposotory. As reported in this repository, you have to generate the contaminated WSJ dataset with the provided MATLAB script. Then, you can run the proposed KALDI baseline to have features and labels ready for our pytorch-kaldi toolkit.

3- Open the following configuration file:

 cfg/DIRHA_baselines/DIRHA_liGRU_fmllr.cfg

The latter configuration file implements a simple RNN model based on a Light Gated Recurrent Unit (Li-GRU). We used fMLLR as input features. Change the paths and run the following command:

 python run_exp.py cfg/DIRHA_baselines/DIRHA_liGRU_fmllr.cfg

4- Results: The aforementioned system should provide Word Error Rate (WER%)=23.2% . You can find the results obtained by us here.

Using the other configuration files in the cfg/DIRHA_baselines folder you can perform experiments with different setups. With the provided configuration files you can obtain the following results:

نموذج	WER(%)
MLP	26.1
جرو	25.3
Li-GRU	23.8

Training an autoencoder

The current version of the repository is mainly designed for speech recognition experiments. We are actively working a new version, which is much more flexible and can manage input/output different from Kaldi features/labels. Even with the current version, however, it is possible to implement other systems, such as an autoencoder.

An autoencoder is a neural network whose inputs and outputs are the same. The middle layer normally contains a bottleneck that forces our representations to compress the information of the input. In this tutorial, we provide a toy example based on the TIMIT dataset. For instance, see the following configuration file:

 cfg/TIMIT_baselines/TIMIT_MLP_fbank_autoencoder.cfg

Our inputs are the standard 40-dimensional fbank coefficients that are gathered using a context windows of 11 frames (ie, the total dimensionality of our input is 440). A feed-forward neural network (called MLP_encoder) encodes our features into a 100-dimensional representation. The decoder (called MLP_decoder) is fed by the learned representations and tries to reconstruct the output. The system is trained with Mean Squared Error (MSE) metric. Note that in the [Model] section we added this line “err_final=cost_err(dec_out,lab_cd)” at the end. The current version of the model, in fact, by default needs that at least one label is specified (we will remove this limit in the next version).

You can train the system running the following command:

 python run_exp.py cfg/TIMIT_baselines/TIMIT_MLP_fbank_autoencoder.cfg

The results should look like this:

 ep=000 tr=['TIMIT_tr'] loss=0.139 err=0.999 valid=TIMIT_dev loss=0.076 err=1.000 lr_architecture1=0.080000 lr_architecture2=0.080000 time(s)=41
ep=001 tr=['TIMIT_tr'] loss=0.098 err=0.999 valid=TIMIT_dev loss=0.062 err=1.000 lr_architecture1=0.080000 lr_architecture2=0.080000 time(s)=39
ep=002 tr=['TIMIT_tr'] loss=0.091 err=0.999 valid=TIMIT_dev loss=0.058 err=1.000 lr_architecture1=0.040000 lr_architecture2=0.040000 time(s)=39
ep=003 tr=['TIMIT_tr'] loss=0.088 err=0.999 valid=TIMIT_dev loss=0.056 err=1.000 lr_architecture1=0.020000 lr_architecture2=0.020000 time(s)=38
ep=004 tr=['TIMIT_tr'] loss=0.087 err=0.999 valid=TIMIT_dev loss=0.055 err=0.999 lr_architecture1=0.010000 lr_architecture2=0.010000 time(s)=39
ep=005 tr=['TIMIT_tr'] loss=0.086 err=0.999 valid=TIMIT_dev loss=0.054 err=1.000 lr_architecture1=0.005000 lr_architecture2=0.005000 time(s)=39
ep=006 tr=['TIMIT_tr'] loss=0.086 err=0.999 valid=TIMIT_dev loss=0.054 err=1.000 lr_architecture1=0.002500 lr_architecture2=0.002500 time(s)=39
ep=007 tr=['TIMIT_tr'] loss=0.086 err=0.999 valid=TIMIT_dev loss=0.054 err=1.000 lr_architecture1=0.001250 lr_architecture2=0.001250 time(s)=39
ep=008 tr=['TIMIT_tr'] loss=0.086 err=0.999 valid=TIMIT_dev loss=0.054 err=0.999 lr_architecture1=0.000625 lr_architecture2=0.000625 time(s)=41
ep=009 tr=['TIMIT_tr'] loss=0.086 err=0.999 valid=TIMIT_dev loss=0.054 err=0.999 lr_architecture1=0.000313 lr_architecture2=0.000313 time(s)=38

You should only consider the field "loss=". The filed "err=" only contains not useuful information in this case (for the aforementioned reason). You can take a look into the generated features typing the following command:

 copy-feats ark:exp/TIMIT_MLP_fbank_autoencoder/exp_files/forward_TIMIT_test_ep009_ck00_enc_out.ark  ark,t:- | more

مراجع

[1] M. Ravanelli, T. Parcollet, Y. Bengio, "The PyTorch-Kaldi Speech Recognition Toolkit", ArxIv

[2] M. Ravanelli, P. Brakel, M. Omologo, Y. Bengio, "Improving speech recognition by revising gated recurrent units", in Proceedings of Interspeech 2017. ArXiv

[3] M. Ravanelli, P. Brakel, M. Omologo, Y. Bengio, "Light Gated Recurrent Units for Speech Recognition", in IEEE Transactions on Emerging Topics in Computational Intelligence. ArXiv

[4] M. Ravanelli, "Deep Learning for Distant Speech Recognition", PhD Thesis, Unitn 2017. ArXiv

[5] T. Parcollet, M. Ravanelli, M. Morchid, G. Linarès, C. Trabelsi, R. De Mori, Y. Bengio, "Quaternion Recurrent Neural Networks", in Proceedings of ICLR 2019 ArXiv

[6] T. Parcollet, M. Morchid, G. Linarès, R. De Mori, "Bidirectional Quaternion Long-Short Term Memory Recurrent Neural Networks for Speech Recognition", in Proceedings of ICASSP 2019 ArXiv

يوسع