Belarusian NLP and Speech Processing resources
This repository contains links to Belarusian Natural Language and Speech Processing resources and datasets.
It is inspired by similar project with Ukrainian Speech Processing resources: egorsmkv/speech-recognition-uk
TODOs:
- add detailed descriptions to each of list items
- evaluate models on benchmarks and log their performance
? Speech-to-Text
? Implementations
? Benchmarks
Model comparisons grouped by dataset. TODO
? Datasets
- Common Voice. Speech recognition dataset
- Dataset from knihi.com. TODO: what is the type of dataset?
- google/fleurs
- ssrlab: TODO. Speech recognition dataset
? Text-to-Speech
? Implementations
- CoquiAI implementations
- jhlfrfufyfn/bel-tts. GlowTTS + HifiGan
- Code
- Model
- Demo on HuggingFace
- Demo on a custom web-page. The source code for the demo page: here
- alex73/belarusian-tts. CoquiAI implementation by Yurii Paniv (@robinhad).
Original repo & models were deleted - only fork is available now
NLP
POS-tagging
- KoichiYasuoka/roberta-small-belarusian-upos
- stanfordnlp/stanza-be
- poritski/YABC_Tagger. Rule-based POS-tagger and lemmatizer.
Written in Perl.
Uses poritski/YABC as a Grammar base (?)
- volchek/beltagger.
An improved version of poritski/YABC_Tagger rule-based POS-tagger and lemmatizer.
Cross-platform, written in C++.
Known issues:
- requires input data to be incoded in Windows-1251, does not support UTF-8;
- tagset is not fully-compatible with BNKorpus's tagset and grammar base
- grammar base used is not full enough. Belarus/GrammarDB is a better paradigms source but is not incorporated yet
- suffix table calculation script is not ported from Perl to C++
- code uses Boost libarary
Other
- pkasila/bel-sklony - web page with Belarusian nouns declension. Demo: sklony.pkasila.net
Masked Language Modeling
- KoichiYasuoka/roberta-small-belarusian
Datasets
- oscar
- mc4
- poritski/YABC - Эксперыментальны корпус беларускай мовы, ЭКБМ
- Belarus/GrammarDB - Grammar Database of Belarusian language
- tsimafeip/Translator - Dataset with russian-belarusian translation pairs
- Universal dependencies dataset:
- Tatoeba Belarusian sentences
?♀️? Communities and platforms:
- corpus.by
- ssrlab.by
- bnkorpus.info
- Belarus organization on github
- nlproc.by community on github
? Unsorted