FCH TTS Download - FCH TTS Source Code Download

FCH TTS

AI Source Code

1.0.0

Download

Simplified Chinese | English

Parallel speech synthesis

[TOC]

New progress

2021/04/20 Merge the wavegan branch to the main main branch and delete the wavegan branch!
2021/04/13 Create an encoder branch for developing voice style migration modules!
2021/04/13 The softdtw branch supports using SoftDTW loss training model!
2021/04/09 ~~wavegan branch (deleted)~~ Provides PWG/MelGAN/Multi-band MelGAN vocoder!
2021/04/05 Support ParallelText2Mel + MelGAN vocoder!
[Key Information] Speed indicators, synthetic samples, web demonstrations, some questions, welcome to communicate...

Directory structure

 .
|--- config/      # 配置文件
     |--- default.yaml
     |--- ...
|--- datasets/    # 数据处理
|--- encoder/     # 声纹编码器
     |--- voice_encoder.py
     |--- ...
|--- helpers/     # 一些辅助类
     |--- trainer.py
     |--- synthesizer.py
     |--- ...
|--- logdir/      # 训练过程保存目录
|--- losses/      # 一些损失函数
|--- models/      # 合成模型
     |--- layers.py
     |--- duration.py
     |--- parallel.py
|--- pretrained/  # 预训练模型（LJSpeech 数据集）
|--- samples/     # 合成样例
|--- utils/       # 一些通用方法
|--- vocoder/     # 声码器
     |--- melgan.py
     |--- ...
|--- wandb/       # Wandb 保存目录
|--- extract-duration.py
|--- extract-embedding.py
|--- LICENSE
|--- prepare-dataset.py  # 准备脚本
|--- README.md
|--- README_en.md
|--- requirements.txt    # 依赖文件
|--- synthesize.py       # 合成脚本
|--- train-duration.py   # 训练脚本
|--- train-parallel.py

Synthetic sample

See some synthetic examples here.

Pre-training

Some pre-trained models are shown here.

Start quickly

Step (1) : Cloning the repository

$ git clone https://github.com/atomicoo/ParallelTTS.git

Step (2) : Install dependencies

$ conda create -n ParallelTTS python=3.7.9
$ conda activate ParallelTTS
$ pip install -r requirements.txt

Step (3) : Synthetic pronunciation

$ python synthesize.py 
  --checkpoint ./pretrained/ljspeech-parallel-epoch0100.pth 
  --melgan_checkpoint ./pretrained/ljspeech-melgan-epoch3200.pth 
  --input_texts ./samples/english/synthesize.txt 
  --outputs_dir ./outputs/

If you want to synthesize voices in other languages, you need to specify the corresponding configuration file through --config .

How to train

Step (1) : Prepare the data

$ python prepare-dataset.py

The configuration file can be specified through --config , and the default default.yaml is for the LJSpeech dataset.

Step (2) : Training the alignment model

$ python train-duration.py

Step (3) : Extraction duration

$ python extract-duration.py

--ground_truth can be used to specify whether the Ground-Truth spectrum is generated using the alignment model.

Step (4) : Training the synthetic model

$ python train-parallel.py

--ground_truth can be used to specify whether to use the Ground-Truth spectrum for model training.

Training log

If you use TensorBoardX, run the following command:

 $ tensorboard --logdir logdir/[DIR]/

Wandb (Weights & Biases) is highly recommended, just add the --enable_wandb option to the above training command.

Dataset

LJSpeech: English, female, 22050 Hz, approximately 24 hours
LibriSpeech: English, multi-speaker (only train-clean-100 part), 16000 Hz, total about 1000 hours
JSUT: Japanese, female, 48000 Hz, about 10 hours
BiaoBei: Mandarin, female, 48000 Hz, about 12 hours
KSS: Korean, female, 44100 Hz, about 12 hours
RuLS: Russian, multi-speaker (only single-speaker audio), 16000 Hz, total about 98 hours
TWLSpeech (non-public, poor quality): Tibetan, female (more speakers, similar tone), 16000 Hz, about 23 hours

Quality Assessment

TODO: To be supplemented

Speed metrics

Training speed : For the LJSpeech dataset, set the batch size to 64, and you can train on a single 8GB GTX 1080 graphics card. After training ~8h (~300 epochs), you can synthesize high-quality voices.

Synthesis speed : The following tests are performed under CPU @ Intel Core i7-8550U / GPU @ NVIDIA GeForce MX150, each segment of synthesized audio is about 8 seconds (about 20 words)

Batch size	Spec (GPU)	Audio (GPU)	Spec (CPU)	Audio (CPU)
1	0.042	0.218	0.100	2.004
2	0.046	0.453	0.209	3.922
4	0.053	0.863	0.407	7.897
8	0.062	2.386	0.878	14.599

Note that no multiple tests were conducted to get the average, and the results are for reference only.

Some questions

In the wavegan branch, vocoder code is taken from ParallelWaveGAN. Due to incompatible acoustic feature extraction methods, it is necessary to convert. See here for the specific conversion code.
The text input of the Mandarin model selects the pinyin sequence, because the original pinyin sequence of BiaoBei does not contain punctuation and the alignment model is not fully trained, so the rhythm of the synthetic speech will be a bit problematic.
The Korean model does not specifically train the corresponding vocoder, but uses LJSpeech (also 22050 Hz) vocoder, which may slightly affect the quality of synthetic speech.