GenerSpeech Download - GenerSpeech Source code download

GenerSpeech

AI Source Code

1.0.0

Download

GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech

Rongjie Huang, Yi Ren, Jinglin Liu, Chenye Cui, Zhou Zhao | Zhejiang University, Sea AI Lab

PyTorch Implementation of GenerSpeech (NeurIPS'22): a text-to-speech model towards high-fidelity zero-shot style transfer of OOD custom voice.

We provide our implementation and pretrained models in this repository.

Visit our demo page for audio samples.

News

December, 2022: GenerSpeech (NeurIPS 2022) released at Github.

Key Features

Multi-level Style Transfer for expressive text-to-speech.
Enhanced model generalization to out-of-distribution (OOD) style reference.

Quick Started

We provide an example of how you can generate high-fidelity samples using GenerSpeech.

To try on your own dataset, simply clone this repo in your local machine provided with NVIDIA GPU + CUDA cuDNN and follow the below instructions.

Support Datasets and Pretrained Models

You can use pretrained models we provide here, and data here. Details of each folder are as in follows:

Model	Dataset (16 kHz)	Discription
GenerSpeech	LibriTTS,ESD	Acousitic model (config)
HIFI-GAN	LibriTTS,ESD	Neural Vocoder
Encoder	/	Emotion Encoder

More supported datasets are coming soon.

Dependencies

A suitable conda environment named generspeech can be created and activated with:

conda env create -f environment.yaml
conda activate generspeech

Multi-GPU

By default, this implementation uses as many GPUs in parallel as returned by torch.cuda.device_count(). You can specify which GPUs to use by setting the CUDA_DEVICES_AVAILABLE environment variable before running the training module.

Inference (Zero-shot TTS)

Here we provide a speech synthesis pipeline using GenerSpeech.

Prepare GenerSpeech (acoustic model): Download and put checkpoint at checkpoints/GenerSpeech
Prepare HIFI-GAN (neural vocoder): Download and put checkpoint at checkpoints/trainset_hifigan
Prepare Emotion Encoder: Download and put checkpoint at checkpoints/Emotion_encoder.pt
Prepare dataset: Download and put statistical files at data/binary/training_set
Prepare path/to/reference_audio (16k): By default, GenerSpeech uses ASR + MFA to obtain the text-speech alignment from reference.

CUDA_VISIBLE_DEVICES=$GPU python inference/GenerSpeech.py --config modules/GenerSpeech/config/generspeech.yaml  --exp_name GenerSpeech --hparams="text='here we go',ref_audio='assets/0011_001570.wav'"

Generated wav files are saved in infer_out by default.

Train your own model

Data Preparation and Configuration

Set raw_data_dir, processed_data_dir, binary_data_dir in the config file, and download dataset to raw_data_dir.
Check preprocess_cls in the config file. The dataset structure needs to follow the processor preprocess_cls, or you could rewrite it according to your dataset. We provide a Libritts processor as an example in modules/GenerSpeech/config/generspeech.yaml
Download global emotion encoder to emotion_encoder_path. For more details, please refer to this branch.
Preprocess Dataset

# Preprocess step: unify the file structure.
python data_gen/tts/bin/preprocess.py --config $path/to/config
# Align step: MFA alignment.
python data_gen/tts/bin/train_mfa_align.py --config $path/to/config
# Binarization step: Binarize data for fast IO.
CUDA_VISIBLE_DEVICES=$GPU python data_gen/tts/bin/binarize.py --config $path/to/config

You could also build a dataset via NATSpeech, which shares a common MFA data-processing procedure. We also provide our processed dataset (16kHz LibriTTS+ESD).

Training GenerSpeech

CUDA_VISIBLE_DEVICES=$GPU python tasks/run.py --config modules/GenerSpeech/config/generspeech.yaml  --exp_name GenerSpeech --reset

Inference using GenerSpeech

CUDA_VISIBLE_DEVICES=$GPU python tasks/run.py --config modules/GenerSpeech/config/generspeech.yaml  --exp_name GenerSpeech --infer

Acknowledgements

This implementation uses parts of the code from the following Github repos: FastDiff, NATSpeech, as described in our code.

Citations

If you find this code useful in your research, please cite our work:

@inproceedings{huanggenerspeech,
  title={GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech},
  author={Huang, Rongjie and Ren, Yi and Liu, Jinglin and Cui, Chenye and Zhao, Zhou},
  booktitle={Advances in Neural Information Processing Systems}
}

Disclaimer

Any organization or individual is prohibited from using any technology mentioned in this paper to generate someone's speech without his/her consent, including but not limited to government leaders, political figures, and celebrities. If you do not comply with this item, you could be in violation of copyright laws.

Expand

Additional Information

Version 1.0.0
Type AI Source Code
Update Time 2025-08-22
size 256.8KB
From Github

Related Applications

ML stack

2025-07-01
awesome free chatgpt

2025-01-04
pywin_contextmenu

2025-08-31
promptl

2025-02-17
tick.chat

2025-09-16
FastLoRAChat

2025-09-03

Recommended for You

chat.petals.dev

Other source code

1.0.0
GPT Prompt Templates

Other source code

1.0.0
GPTyped

Other source code

GPTyped 1.0.5
ML stack

AI Source Code

1.0.0
awesome free chatgpt

AI Source Code

1.0.0
pywin_contextmenu

AI Source Code

Version update
Google Dorks

Other source code

1.0
shepherd

Other source code

v6.1.6-react-shepherd: Prepare Release (#3063)
mongo express

Other source code

v1.1.0-rc-3

Related Information All