?? Speech Recognition & Synthesis for Ukrainian
Overview
This repository collects links to models, datasets, and tools for Ukrainian Speech-to-Text and Text-to-Speech projects.
Community
- Discord: https://bit.ly/discord-uds
- Speech Recognition: https://t.me/speech_recognition_uk
- Speech Synthesis: https://t.me/speech_synthesis_uk
? Speech-to-Text
? Implementations
wav2vec2-bert
- 600M params: https://huggingface.co/Yehor/w2v-bert-2.0-uk-v2 (demo: https://huggingface.co/spaces/Yehor/w2v-bert-2.0-uk-v2-demo)
wav2vec2
- 1B params (with language model based on small portion of data): https://huggingface.co/Yehor/wav2vec2-xls-r-1b-uk-with-lm
- 1B params (with language model based on News texts): https://huggingface.co/Yehor/wav2vec2-xls-r-1b-uk-with-news-lm
- 1B params (with binary language model based on News texts): https://huggingface.co/Yehor/wav2vec2-xls-r-1b-uk-with-binary-news-lm
- 1B params (with language model: OSCAR): https://huggingface.co/arampacha/wav2vec2-xls-r-1b-uk
- 1B params (with language model: OSCAR): https://huggingface.co/arampacha/wav2vec2-xls-r-1b-uk-cv
- 300M params (with language model based on small portion of data): https://huggingface.co/Yehor/wav2vec2-xls-r-300m-uk-with-lm
- 300M params (but without language model): https://huggingface.co/robinhad/wav2vec2-xls-r-300m-uk
- 300M params (with language model based on small portion of data): https://huggingface.co/Yehor/wav2vec2-xls-r-300m-uk-with-small-lm
- 300M params (with language model based on small portion of data) and noised data: https://huggingface.co/Yehor/wav2vec2-xls-r-300m-uk-with-small-lm-noisy
- 300M params (with language model based on News texts): https://huggingface.co/Yehor/wav2vec2-xls-r-300m-uk-with-news-lm
- 300M params (with language model based on Wikipedia texts): https://huggingface.co/Yehor/wav2vec2-xls-r-300m-uk-with-wiki-lm
- 90M params (with language model based on small portion of data): https://huggingface.co/Yehor/wav2vec2-xls-r-base-uk-with-small-lm
- 90M params (with language model based on small portion of data): https://huggingface.co/Yehor/wav2vec2-xls-r-base-uk-with-cv-lm
- ONNX model (1B and 300M models): https://github.com/egorsmkv/ukrainian-onnx-model
You can check demos out here: https://github.com/egorsmkv/wav2vec2-uk-demo
data2vec
- data2vec-large: https://huggingface.co/robinhad/data2vec-large-uk
Citrinet
- NVIDIA Streaming Citrinet 1024 (uk): https://huggingface.co/nvidia/stt_uk_citrinet_1024_gamma_0_25
- NVIDIA Streaming Citrinet 512 (uk): https://huggingface.co/neongeckocom/stt_uk_citrinet_512_gamma_0_25
ContextNet
FastConformer
Squeezeformer
-
Squeezeformer-CTC ML: https://huggingface.co/theodotus/stt_uk_squeezeformer_ctc_ml
- Demo 1: https://huggingface.co/spaces/theodotus/streaming-asr-uk
- Demo 2: https://huggingface.co/spaces/theodotus/buffered-asr-uk
-
Squeezeformer-CTC SM: https://huggingface.co/theodotus/stt_uk_squeezeformer_ctc_sm
-
Squeezeformer-CTC XS: https://huggingface.co/theodotus/stt_uk_squeezeformer_ctc_xs
Conformer-CTC
VOSK
- VOSK v3 nano (with dynamic graph): https://drive.google.com/file/d/1Pwlxmtz7SPPm1DThBPM3u66nH6-Dsb1n/view?usp=sharing (73 mb)
- VOSK v3 small (with dynamic graph): https://drive.google.com/file/d/1Zkambkw2hfpLbMmpq2AR04-I7nhyjqtd/view?usp=sharing (133 mb)
- VOSK v3 (with dynamic graph): https://drive.google.com/file/d/12AdVn-EWFwEJXLzNvM0OB-utSNf7nJ4Q/view?usp=sharing (345 mb)
- VOSK v3: https://drive.google.com/file/d/17umTgQuvvWyUiCJXET1OZ3kWNfywPjW2/view?usp=sharing (343 mb)
- VOSK v2: https://drive.google.com/file/d/1MdlN3JWUe8bpCR9A0irEr-Icc1WiPgZs/view?usp=sharing (339 mb, demo code: https://github.com/egorsmkv/vosk-ukrainian-demo)
- VOSK v1: https://drive.google.com/file/d/1nzpXRd4Gtdi0YVxCFYzqtKKtw_tPZQfK/view?usp=sharing (87 mb, an old model with less trained data)
Note: VOSK models are licensed under Apache License 2.0.
DeepSpeech
- DeepSpeech using transfer learning from English model: https://github.com/robinhad/voice-recognition-ua
- v0.5: https://github.com/robinhad/voice-recognition-ua/releases/tag/v0.5 (1230+ hours)
- v0.4: https://github.com/robinhad/voice-recognition-ua/releases/tag/v0.4 (1230 hours)
- v0.3: https://github.com/robinhad/voice-recognition-ua/releases/tag/v0.3 (751 hours)
M-CTC-T
- m-ctc-t-large: https://huggingface.co/speechbrain/m-ctc-t-large
whisper
- official whisper: https://github.com/openai/whisper
- whisper (small, fine-tuned for Ukrainian): https://github.com/egorsmkv/whisper-ukrainian
- whisper (large, fine-tuned for Ukrainian): https://huggingface.co/arampacha/whisper-large-uk-2
- https://huggingface.co/mitchelldehaven/whisper-medium-uk
- https://huggingface.co/mitchelldehaven/whisper-large-v2-uk
Flashlight
- Flashlight Conformer: https://github.com/egorsmkv/flashlight-ukrainian
Benchmarks
This benchmark uses Common Voice 10 test split.
wav2vec2-bert
| Model |
WER |
CER |
Accuracy, % |
WER+LM |
CER+LM |
Accuracy+LM, % |
| Yehor/w2v-bert-2.0-uk |
0.0727 |
0.0151 |
92.73% |
0.0655 |
0.0139 |
93.45% |
wav2vec2
| Model |
WER |
CER |
Accuracy, % |
WER+LM |
CER+LM |
Accuracy+LM, % |
| Yehor/wav2vec2-xls-r-1b-uk-with-lm |
0.1807 |
0.0317 |
81.93% |
0.1193 |
0.0218 |
88.07% |
| Yehor/wav2vec2-xls-r-1b-uk-with-binary-news-lm |
0.1807 |
0.0317 |
81.93% |
0.0997 |
0.0191 |
90.03% |
| Yehor/wav2vec2-xls-r-300m-uk-with-lm |
0.2906 |
0.0548 |
70.94% |
0.172 |
0.0355 |
82.8% |
| Yehor/wav2vec2-xls-r-300m-uk-with-news-lm |
0.2027 |
0.0365 |
79.73% |
0.0929 |
0.019 |
90.71% |
| Yehor/wav2vec2-xls-r-300m-uk-with-wiki-lm |
0.2027 |
0.0365 |
79.73% |
0.1045 |
0.0208 |
89.55% |
| Yehor/wav2vec2-xls-r-base-uk-with-small-lm |
0.4441 |
0.0975 |
55.59% |
0.2878 |
0.0711 |
71.22% |
| robinhad/wav2vec2-xls-r-300m-uk |
0.2736 |
0.0537 |
72.64% |
- |
- |
- |
| arampacha/wav2vec2-xls-r-1b-uk |
0.1652 |
0.0293 |
83.48% |
0.0945 |
0.0175 |
90.55% |
Citrinet
lm-4gram-500k is used as the LM
| Model |
WER |
CER |
Accuracy, % |
WER+LM |
CER+LM |
Accuracy+LM, % |
| nvidia/stt_uk_citrinet_1024_gamma_0_25 |
0.0432 |
0.0094 |
95.68% |
0.0352 |
0.0079 |
96.48% |
| neongeckocom/stt_uk_citrinet_512_gamma_0_25 |
0.0746 |
0.016 |
92.54% |
0.0563 |
0.0128 |
94.37% |
ContextNet
| Model |
WER |
CER |
Accuracy, % |
| theodotus/stt_uk_contextnet_512 |
0.0669 |
0.0145 |
93.31% |
FastConformer P&C
This model supports text punctuation and capitalization
| Model |
WER |
CER |
Accuracy, % |
WER+P&C |
CER+P&C |
Accuracy+P&C, % |
| theodotus/stt_ua_fastconformer_hybrid_large_pc |
0.0400 |
0.0102 |
96.00% |
0.0710 |
0.0167 |
92.90% |
Squeezeformer
lm-4gram-500k is used as the LM
| Model |
WER |
CER |
Accuracy, % |
WER+LM |
CER+LM |
Accuracy+LM, % |
| theodotus/stt_uk_squeezeformer_ctc_xs |
0.1078 |
0.0229 |
89.22% |
0.0777 |
0.0174 |
92.23% |
| theodotus/stt_uk_squeezeformer_ctc_sm |
0.082 |
0.0175 |
91.8% |
0.0605 |
0.0142 |
93.95% |
| theodotus/stt_uk_squeezeformer_ctc_ml |
0.0591 |
0.0126 |
94.09% |
0.0451 |
0.0105 |
95.49% |
Flashlight
lm-4gram-500k is used as the LM
| Model |
WER |
CER |
Accuracy, % |
WER+LM |
CER+LM |
Accuracy+LM, % |
| Flashlight Conformer |
0.1915 |
0.0244 |
80.85% |
0.0907 |
0.0198 |
90.93% |
data2vec
| Model |
WER |
CER |
Accuracy, % |
| robinhad/data2vec-large-uk |
0.3117 |
0.0731 |
68.83% |
VOSK
| Model |
WER |
CER |
Accuracy, % |
| v3 |
0.5325 |
0.3878 |
46.75% |
m-ctc-t
| Model |
WER |
CER |
Accuracy, % |
| speechbrain/m-ctc-t-large |
0.57 |
0.1094 |
43% |
whisper
| Model |
WER |
CER |
Accuracy, % |
| tiny |
0.6308 |
0.1859 |
36.92% |
| base |
0.521 |
0.1408 |
47.9% |
| small |
0.3057 |
0.0764 |
69.43% |
| medium |
0.1873 |
0.044 |
81.27% |
| large (v1) |
0.1642 |
0.0393 |
83.58% |
| large (v2) |
0.1372 |
0.0318 |
86.28% |
Fine-tuned version for Ukrainian:
| Model |
WER |
CER |
Accuracy, % |
| small |
0.2704 |
0.0565 |
72.96% |
| large |
0.2482 |
0.055 |
75.18% |
If you want to fine-tune a Whisper model on own data, then use this repository: https://github.com/egorsmkv/whisper-ukrainian
DeepSpeech
| Model |
WER |
CER |
Accuracy, % |
| v0.5 |
0.7025 |
0.2009 |
29.75% |
Development
- How to train own model using Kaldi (in Russian): https://github.com/egorsmkv/speech-recognition-uk/blob/master/vosk-model-creation/INSTRUCTION.md
- How to train a KenLM model based on Ukrainian Wikipedia data: https://github.com/egorsmkv/ukwiki-kenlm
- Export a traced JIT version of wav2vec2 models: https://github.com/egorsmkv/wav2vec2-jit
Datasets
Compiled dataset from different open sources + Companies + Community = 188.31GB / ~1200 hours ?
- Storage Share powered by Nextcloud: https://nx16725.your-storageshare.de/s/cAbcBeXtdz7znDN (use Wget to download, downloading in a browser has speed limitations)
- Torrent file: https://academictorrents.com/details/fcf8bb60c59e9eb583df003d54ed61776650beb8 (188.31 GB)
Voice of America (398 hours)
- Storage Share powered by Nextcloud: https://nx16725.your-storageshare.de/s/f4NYHXdEw2ykZKa
FLEURS
- Ukrainian subset: https://huggingface.co/datasets/google/fleurs/viewer/uk_ua/train
YODAS2
- Ukrainian subsets:
- https://huggingface.co/datasets/espnet/yodas2/tree/main/data/uk000
- https://huggingface.co/datasets/espnet/yodas2/tree/main/data/uk100
Companies
- Mozilla Common Voice has the Ukrainian dataset: https://commonvoice.mozilla.org/uk/datasets
- M-AILABS Ukrainian Corpus Ukrainian: http://www.caito.de/data/Training/stt_tts/uk_UK.tgz
- Espreso TV subset: https://blog.gdeltproject.org/visual-explorer-quick-workflow-for-downloading-belarusian-russian-ukrainian-transcripts-translations/
Ukrainian podcasts
- https://huggingface.co/datasets/taras-sereda/uk-pods
Cleaned Common Voice 10 (test set)
- Repository: https://github.com/egorsmkv/cv10-uk-testset-clean
Noised Common Voice 10
- Transcriptions: https://www.dropbox.com/s/ohj3y2cq8f4207a/transcriptions.zip?dl=0
- Audio files: https://www.dropbox.com/s/v8crgclt9opbrv1/data.zip?dl=0
Community
- VoxForge Repository: http://www.repository.voxforge1.org/downloads/uk/Trunk/
Other
- ASR Corpus created using a Telegram bot for Ukrainian: https://github.com/egorsmkv/asr-tg-bot-corpus
- Speech Dataset with Ukrainian: https://www.caito.de/2019/01/the-m-ailabs-speech-dataset/
Related works
Language models
- Ukrainian LMs: https://huggingface.co/Yehor/kenlm-ukrainian
Inverse Text Normalization:
- WFST for Ukrainian Inverse Text Normalization: https://github.com/lociko/ukraine_itn_wfst
Text Enhancement
- Punctuation and capitalization model: https://huggingface.co/dchaplinsky/punctuation_uk_bert (demo: https://huggingface.co/spaces/Yehor/punctuation-uk)
Aligners
- Aligner for wav2vec2-bert models: https://github.com/egorsmkv/w2v2-bert-aligner
- Aligner based on FasterWhisper (mostly for TTS): https://github.com/patriotyk/narizaka
- Aligner based on Kaldi: https://github.com/proger/uk
? Text-to-Speech
Test sentence with stresses:
К+ам'ян+ець-Под+ільський - м+істо в Хмельн+ицькій +області Укра+їни, ц+ентр Кам'ян+ець-Под+ільської міськ+ої об'+єднаної територі+альної гром+ади +і Кам'ян+ець-Под+ільського рай+ону.
Without stresses:
Кам'янець-Подільський - місто в Хмельницькій області України, центр Кам'янець-Подільської міської об'єднаної територіальної громади і Кам'янець-Подільського району.
? Implementations
StyleTTS2
- StyleTTS2 demo & the code
P-Flow TTS
audio.mp4
RAD-TTS
- RAD-TTS, the voice "Lada"
- RAD-TTS with three voices, voices of Lada, Tetiana, and Mykyta
demo.mp4
Coqui TTS
-
v1.0.0 using M-AILABS dataset: https://github.com/robinhad/ukrainian-tts/releases/tag/v1.0.0 (200,000 steps)
-
v2.0.0 using Mykyta/Olena dataset: https://github.com/robinhad/ukrainian-tts/releases/tag/v2.0.0 (140,000 steps)
tts_output.mp4
Neon TTS
- Coqui TTS model implemented in the Neon Coqui TTS Python Plugin. An interactive demo is available on huggingface. This model and others can be downloaded from huggingface and more information can be found at neon.ai
neon_tts.mp4
FastPitch
- NVIDIA FastPitch: https://huggingface.co/theodotus/tts_uk_fastpitch
Balacoon TTS
- Balacoon TTS, voices of Lada, Tetiana and Mykyta. Blog post on model release.
balacoon_tts.mp4
Datasets
- Open Text-to-Speech voices for ?? Ukrainian: https://huggingface.co/datasets/Yehor/opentts-uk
- Voice "LADA", female
- Voice "TETIANA", female
- Voice "KATERYNA", female
- Voice "MYKYTA", male
- Voice "OLEKSA", male
Related works
Accentors
- https://github.com/NeonBohdan/ukrainian-accentor-transformer
- https://github.com/lang-uk/ukrainian-word-stress
- https://github.com/egorsmkv/ukrainian-accentor
Misc
- Tool to make high quality text to speech (TTS) corpus from audio + text books: https://github.com/patriotyk/narizaka
- A model to do Text Normalization: https://huggingface.co/skypro1111/mbart-large-50-verbalization