This repository is a Phonemic Multilingual (Russian -english) Implementation Based on Real-Time-Voice-Cloning. IS A Four-Stage Deep Learning Framework that Allows to Create A Numerical Representation of A Voice From a Few Seconds of Audio, and To us to To Condition a Text-to-Speech Model. IF YOU TO NEED THE ENGLISH VERSION, Please Use The Original Implementation.
This repository is a multilingual (Russian-English) phonemic implementation based on Real-Time-Voice-Cloning. It consists of four neural networks that allow you to create a numerical representation of the voice from several seconds of sound and use it to create a model for converting text into speech
Use the Colab Online Demo
You Will Need the Following Whather You Plan To Use The Toolbox Only To Retrain the Models.
≥Python 3.6 .
Pytorch (> = 1.0.1).
Run pip install -r requirements.txt to Install The Necessary Packages.
A GPU IS Mandatory, But You Donat Necessarily Need a High Tier Gpu If You Want To Use The Toolbox.
Download The Latest Geere.
| NAME | Language | Link | Comments | My Link | Comments |
|---|---|---|---|---|---|
| Phoneme Dictionary | En, ru | En, ru | Phoneme Dictionary | Link | Combined Russian and English phonemic dictionary |
| Librispeech | En | Link | 300 Speakers, 360h Clean Speech | ||
| Voxceleb | En | Link | 7000 Speakers, Many Hours Bad Speech | ||
| M-AILABS | Ru | Link | 3 Speakers, 46h Clean Speech | ||
| Open_TTS, Open_stt | Ru | Open_TTS, Open_stt | Many Speakers, Many Hours Bad Speech | Link | Cleaned 4 hours of speech of one speaker. Corrected the anotation, divided into segments up to 7 seconds |
| VoxForge+Audiobook | Ru | Link | Many Speaker, 25h Various Quality | Link | I chose good files. Broke into segments. Added an audiobook from the Internet. It turned out 200 speakers a couple of minutes for each |
| Ruslan | Ru | Link | One Speaker, 40h Good Speech | Link | Corrected in 16kHz |
| Mozilla | Ru | Link | 50 Speaker, 30h Good Speech | Link | Carred in 16kHz, scattered different users in folders |
| Russian Single | Ru | Link | One Speaker, 9h Good Speech | Link | Corrected in 16kHz |
You can do try the TOOLBOX:
python demo_toolbox.py -d <datasets_root>
or
python demo_toolbox.py
PretRained Models
Training (and for other languages)
Training (And for Other Languages)
For Any Questions, Please Email Mem
| URL | Designation | Title | Implementation Source |
|---|---|---|---|
| 1806.04558 | SV2TTS | Transfer Learning from Speaker Verification to Multispeaker Text-to-Speech Synthesisis | Corentinj |
| 1802.08435 | Wavernn (Vocoder) | Efficienteral Audio Synthesis | Fatchord/Wavernn |
| 1712.05884 | Tacotron 2 (Synthesizer) | Natural TTS Synthesis by Conditioning Wavenet on Mel Spectrogram Predictions | Rayhane-Mamah/Tacotron-2 |
| 1710.10467 | GE2E (Encoder) | Generalized end-to-end Loss for Speaker Verification | Corentinj |