This is the official code implementation of ? Matcha-TTS [ICASSP 2024].
We propose ? Matcha-TTS, a new approach to non-autoregressive neural TTS, that uses conditional flow matching (similar to rectified flows) to speed up ODE-based speech synthesis. Our method:
Check out our demo page and read our ICASSP 2024 paper for more details.
Pre-trained models will be automatically downloaded with the CLI or gradio interface.
You can also try ? Matcha-TTS in your browser on HuggingFace ? spaces.
conda create -n matcha-tts python=3.10 -y
conda activate matcha-tts
pip install matcha-ttsfrom source
pip install git+https://github.com/shivammehta25/Matcha-TTS.git
cd Matcha-TTS
pip install -e .# This will download the required models
matcha-tts --text "<INPUT TEXT>"or
matcha-tts-appor open synthesis.ipynb on jupyter notebook
matcha-tts --text "<INPUT TEXT>"matcha-tts --file <PATH TO FILE>matcha-tts --file <PATH TO FILE> --batchedAdditional arguments
matcha-tts --text "<INPUT TEXT>" --speaking_rate 1.0matcha-tts --text "<INPUT TEXT>" --temperature 0.667matcha-tts --text "<INPUT TEXT>" --steps 10Let's assume we are training with LJ Speech
Download the dataset from here, extract it to data/LJSpeech-1.1, and prepare the file lists to point to the extracted data like for item 5 in the setup of the NVIDIA Tacotron 2 repo.
Clone and enter the Matcha-TTS repository
git clone https://github.com/shivammehta25/Matcha-TTS.git
cd Matcha-TTSpip install -e .configs/data/ljspeech.yaml and changetrain_filelist_path: data/filelists/ljs_audio_text_train_filelist.txt
valid_filelist_path: data/filelists/ljs_audio_text_val_filelist.txtmatcha-data-stats -i ljspeech.yaml
# Output:
#{'mel_mean': -5.53662231756592, 'mel_std': 2.1161014277038574}Update these values in configs/data/ljspeech.yaml under data_statistics key.
data_statistics: # Computed for ljspeech dataset
mel_mean: -5.536622
mel_std: 2.116101to the paths of your train and validation filelists.
make train-ljspeechor
python matcha/train.py experiment=ljspeechpython matcha/train.py experiment=ljspeech_min_memorypython matcha/train.py experiment=ljspeech trainer.devices=[0,1]matcha-tts --text "<INPUT TEXT>" --checkpoint_path <PATH TO CHECKPOINT>Special thanks to @mush42 for implementing ONNX export and inference support.
It is possible to export Matcha checkpoints to ONNX, and run inference on the exported ONNX graph.
To export a checkpoint to ONNX, first install ONNX with
pip install onnxthen run the following:
python3 -m matcha.onnx.export matcha.ckpt model.onnx --n-timesteps 5Optionally, the ONNX exporter accepts vocoder-name and vocoder-checkpoint arguments. This enables you to embed the vocoder in the exported graph and generate waveforms in a single run (similar to end-to-end TTS systems).
Note that n_timesteps is treated as a hyper-parameter rather than a model input. This means you should specify it during export (not during inference). If not specified, n_timesteps is set to 5.
Important: for now, torch>=2.1.0 is needed for export since the scaled_product_attention operator is not exportable in older versions. Until the final version is released, those who want to export their models must install torch>=2.1.0 manually as a pre-release.
To run inference on the exported model, first install onnxruntime using
pip install onnxruntime
pip install onnxruntime-gpu # for GPU inferencethen use the following:
python3 -m matcha.onnx.infer model.onnx --text "hey" --output-dir ./outputsYou can also control synthesis parameters:
python3 -m matcha.onnx.infer model.onnx --text "hey" --output-dir ./outputs --temperature 0.4 --speaking_rate 0.9 --spk 0To run inference on GPU, make sure to install onnxruntime-gpu package, and then pass --gpu to the inference command:
python3 -m matcha.onnx.infer model.onnx --text "hey" --output-dir ./outputs --gpuIf you exported only Matcha to ONNX, this will write mel-spectrogram as graphs and numpy arrays to the output directory.
If you embedded the vocoder in the exported graph, this will write .wav audio files to the output directory.
If you exported only Matcha to ONNX, and you want to run a full TTS pipeline, you can pass a path to a vocoder model in ONNX format:
python3 -m matcha.onnx.infer model.onnx --text "hey" --output-dir ./outputs --vocoder hifigan.small.onnxThis will write .wav audio files to the output directory.
If the dataset is structured as
data/
└── LJSpeech-1.1
├── metadata.csv
├── README
├── test.txt
├── train.txt
├── val.txt
└── wavsThen you can extract the phoneme level alignments from a Trained Matcha-TTS model using:
python matcha/utils/get_durations_from_trained_model.py -i dataset_yaml -c <checkpoint>Example:
python matcha/utils/get_durations_from_trained_model.py -i ljspeech.yaml -c matcha_ljspeech.ckptor simply:
matcha-tts-get-durations -i ljspeech.yaml -c matcha_ljspeech.ckptIn the datasetconfig turn on load duration.
Example: ljspeech.yaml
load_durations: True
or see an examples in configs/experiment/ljspeech_from_durations.yaml
If you use our code or otherwise find this work useful, please cite our paper:
@inproceedings{mehta2024matcha,
title={Matcha-{TTS}: A fast {TTS} architecture with conditional flow matching},
author={Mehta, Shivam and Tu, Ruibo and Beskow, Jonas and Sz{'e}kely, {'E}va and Henter, Gustav Eje},
booktitle={Proc. ICASSP},
year={2024}
}
Since this code uses Lightning-Hydra-Template, you have all the powers that come with it.
Other source code we would like to acknowledge: