Implementation of "Duration Informed Attention Network for Multimodal Synthesis" (https://arxiv.org/pdf/1909.01700.pdf) paper.
Status: released
DurIAN is encoder-decoder architecture for text-to-speech synthesis task. Unlike prior architectures like Tacotron 2 it doesn't learn attention mechanism but takes into account phoneme durations information. So, of course, to use this model one should have phonemized and duration-aligned dataset. However, you may try to use pretrained duration model on LJSpeech dataset (CMU dict used). Links will be provided below.
DurIAN model consists of two modules: backbone synthesizer and duration predictor. Here are some of the most notable differences from DurIAN described in paper:
Both backbone synthesizer and duration model are trained simultaneously. For implementation simplifications duration model predicts alignment over the fixed max number of frames. You can learn this outputs as BCE problem, MSE problem by summing over frames-axis or to use both losses (haven't tested this one), set it in the config.json. Experiments showed that just-BCE version of optimization process showed itself being unstable with longer text sequences, so prefer using MSE+BCE or just-MSE (don't mind if you get bad alignments in Tensorboard).
You can check the synthesis demo wavfile (was obtained much before convergence) in demo folder (used Waveglow vocoder).
First of all, make sure you have installed all packages using pip install --upgrade -r requirements.txt. The code is tested using pytorch==1.5.0
Clone the repository: git clone https://github.com/ivanvovk/DurrIAN
To start training paper-based DurIAN version run python train.py -c configs/default.json. You can specify to train baseline model as python train.py -c configs/baseline.json --baseline
To make sure that everything works fine at your local environment you may run unit tests in tests folder by python <test_you_want_to_run.py>.
This implementation was trained using phonemized duration-aligned LJSpeech dataset with BCE duration loss minimization. You may find it via this link.
The main drawback of this model is requiring of duration-aligned dataset. You can find parsed LJSpeech filelist used in the training of current implementation in filelists folder. In order to use your data, make sure you have organized your filelists in the same way as provided LJSpeech ones. However, in order to save time and neurons of your brains you may try to train the model on your dataset without duration-aligning using the pretrained on LJSpeech duration model from my model checkpoint (didn't tried). But if you are interested in aligning personal dataset, carefully follow the next section.
In my experiments I aligned LJSpeech with Montreal Forced Alignment tool. If here something will be unclear, please, follow instructions in toolkit's docs. To begin with, aligning algorithm has several steps:
Organize your dataset properly. MFA requires it to be in a single folder of structure {utterance_id.lab, utterance_id.wav}. Make sure all your texts are of .lab format.
Download MFA release and follow installation instructions via this link.
Once done with MFA, you need your dataset words dictionary with phonemes transcriptions. Here you have several options:
bin/mfa_generate_dictionary /path/to/model_g2p.zip /path/to/data dict.txt. Notice, that default MFA installation will automatically provide you with English pretrained model, which you may use.Once you have your data prepared, dictionary and G2P model, now you are ready for aligning. Run the command bin/mfa_align /path/to/data dict.txt path/to/model_g2p.zip outdir. Wait until done. outdir folder will contain a list of out of vocabulary words and a folder with special files of .TextGrid format, where wavs alignments are stored.
Now we want to process these text grid files in order to get the final filelist. Here you may find useful the python package TextGrid. Install it using pip install TextGrid. Here an example how to use it:
import textgrid
tg = textgrid.TextGrid.fromFile('./outdir/data/text0.TextGrid')
Now tg is the set two objects: first one contains aligned words, second one contains aligned phonemes. You need the second one. Extract durations (in frames! tg has intervals in seconds, thus convert) for whole dataset by iterating over obtained .TextGrid files and prepare a filelist in same format as the ones I provided in filelists folder.
I found an overview of several aligners. Maybe it will be helpful. However, I recommend you to use MFA as it is one of the most accurate aligners, to my best knowledge.