Simplified Chinese | English
[TOC]
.
|--- config/ # 配置文件
|--- default.yaml
|--- ...
|--- datasets/ # 数据处理
|--- encoder/ # 声纹编码器
|--- voice_encoder.py
|--- ...
|--- helpers/ # 一些辅助类
|--- trainer.py
|--- synthesizer.py
|--- ...
|--- logdir/ # 训练过程保存目录
|--- losses/ # 一些损失函数
|--- models/ # 合成模型
|--- layers.py
|--- duration.py
|--- parallel.py
|--- pretrained/ # 预训练模型(LJSpeech 数据集)
|--- samples/ # 合成样例
|--- utils/ # 一些通用方法
|--- vocoder/ # 声码器
|--- melgan.py
|--- ...
|--- wandb/ # Wandb 保存目录
|--- extract-duration.py
|--- extract-embedding.py
|--- LICENSE
|--- prepare-dataset.py # 准备脚本
|--- README.md
|--- README_en.md
|--- requirements.txt # 依赖文件
|--- synthesize.py # 合成脚本
|--- train-duration.py # 训练脚本
|--- train-parallel.py
See some synthetic examples here.
Some pre-trained models are shown here.
Step (1) : Cloning the repository
$ git clone https://github.com/atomicoo/ParallelTTS.gitStep (2) : Install dependencies
$ conda create -n ParallelTTS python=3.7.9
$ conda activate ParallelTTS
$ pip install -r requirements.txtStep (3) : Synthetic pronunciation
$ python synthesize.py
--checkpoint ./pretrained/ljspeech-parallel-epoch0100.pth
--melgan_checkpoint ./pretrained/ljspeech-melgan-epoch3200.pth
--input_texts ./samples/english/synthesize.txt
--outputs_dir ./outputs/ If you want to synthesize voices in other languages, you need to specify the corresponding configuration file through --config .
Step (1) : Prepare the data
$ python prepare-dataset.py The configuration file can be specified through --config , and the default default.yaml is for the LJSpeech dataset.
Step (2) : Training the alignment model
$ python train-duration.pyStep (3) : Extraction duration
$ python extract-duration.py --ground_truth can be used to specify whether the Ground-Truth spectrum is generated using the alignment model.
Step (4) : Training the synthetic model
$ python train-parallel.py --ground_truth can be used to specify whether to use the Ground-Truth spectrum for model training.
If you use TensorBoardX, run the following command:
$ tensorboard --logdir logdir/[DIR]/
Wandb (Weights & Biases) is highly recommended, just add the --enable_wandb option to the above training command.
TODO: To be supplemented
Training speed : For the LJSpeech dataset, set the batch size to 64, and you can train on a single 8GB GTX 1080 graphics card. After training ~8h (~300 epochs), you can synthesize high-quality voices.
Synthesis speed : The following tests are performed under CPU @ Intel Core i7-8550U / GPU @ NVIDIA GeForce MX150, each segment of synthesized audio is about 8 seconds (about 20 words)
| Batch size | Spec (GPU) | Audio (GPU) | Spec (CPU) | Audio (CPU) |
|---|---|---|---|---|
| 1 | 0.042 | 0.218 | 0.100 | 2.004 |
| 2 | 0.046 | 0.453 | 0.209 | 3.922 |
| 4 | 0.053 | 0.863 | 0.407 | 7.897 |
| 8 | 0.062 | 2.386 | 0.878 | 14.599 |
Note that no multiple tests were conducted to get the average, and the results are for reference only.
vocoder code is taken from ParallelWaveGAN. Due to incompatible acoustic feature extraction methods, it is necessary to convert. See here for the specific conversion code.