PyTorch Implementation of GenerSpeech (NeurIPS'22): a text-to-speech model towards high-fidelity zero-shot style transfer of OOD custom voice.
We provide our implementation and pretrained models in this repository.
Visit our demo page for audio samples.
We provide an example of how you can generate high-fidelity samples using GenerSpeech.
To try on your own dataset, simply clone this repo in your local machine provided with NVIDIA GPU + CUDA cuDNN and follow the below instructions.
You can use pretrained models we provide here, and data here. Details of each folder are as in follows:
| Model | Dataset (16 kHz) | Discription |
|---|---|---|
| GenerSpeech | LibriTTS,ESD | Acousitic model (config) |
| HIFI-GAN | LibriTTS,ESD | Neural Vocoder |
| Encoder | / | Emotion Encoder |
More supported datasets are coming soon.
A suitable conda environment named generspeech can be created
and activated with:
conda env create -f environment.yaml
conda activate generspeech
By default, this implementation uses as many GPUs in parallel as returned by torch.cuda.device_count().
You can specify which GPUs to use by setting the CUDA_DEVICES_AVAILABLE environment variable before running the training module.
Here we provide a speech synthesis pipeline using GenerSpeech.
checkpoints/GenerSpeechcheckpoints/trainset_hifigancheckpoints/Emotion_encoder.ptdata/binary/training_setCUDA_VISIBLE_DEVICES=$GPU python inference/GenerSpeech.py --config modules/GenerSpeech/config/generspeech.yaml --exp_name GenerSpeech --hparams="text='here we go',ref_audio='assets/0011_001570.wav'"Generated wav files are saved in infer_out by default.
raw_data_dir, processed_data_dir, binary_data_dir in the config file, and download dataset to raw_data_dir.preprocess_cls in the config file. The dataset structure needs to follow the processor preprocess_cls, or you could rewrite it according to your dataset. We provide a Libritts processor as an example in modules/GenerSpeech/config/generspeech.yamlemotion_encoder_path. For more details, please refer to this branch.# Preprocess step: unify the file structure.
python data_gen/tts/bin/preprocess.py --config $path/to/config
# Align step: MFA alignment.
python data_gen/tts/bin/train_mfa_align.py --config $path/to/config
# Binarization step: Binarize data for fast IO.
CUDA_VISIBLE_DEVICES=$GPU python data_gen/tts/bin/binarize.py --config $path/to/configYou could also build a dataset via NATSpeech, which shares a common MFA data-processing procedure. We also provide our processed dataset (16kHz LibriTTS+ESD).
CUDA_VISIBLE_DEVICES=$GPU python tasks/run.py --config modules/GenerSpeech/config/generspeech.yaml --exp_name GenerSpeech --resetCUDA_VISIBLE_DEVICES=$GPU python tasks/run.py --config modules/GenerSpeech/config/generspeech.yaml --exp_name GenerSpeech --inferThis implementation uses parts of the code from the following Github repos: FastDiff, NATSpeech, as described in our code.
If you find this code useful in your research, please cite our work:
@inproceedings{huanggenerspeech,
title={GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech},
author={Huang, Rongjie and Ren, Yi and Liu, Jinglin and Cui, Chenye and Zhao, Zhou},
booktitle={Advances in Neural Information Processing Systems}
}Any organization or individual is prohibited from using any technology mentioned in this paper to generate someone's speech without his/her consent, including but not limited to government leaders, political figures, and celebrities. If you do not comply with this item, you could be in violation of copyright laws.