This is a repository for our paper, ? Nix-TTS (Accepted to IEEE SLT 2022). We released the pretrained models, an interactive demo, and audio samples below.
[[? Paper Link](Coming Soon!)] [? Interactive Demo] [? Audio Samples]
Abstract Several solutions for lightweight TTS have shown promising results. Still, they either rely on a hand-crafted design that reaches non-optimum size or use a neural architecture search but often suffer training costs. We present Nix-TTS, a lightweight TTS achieved via knowledge distillation to a high-quality yet large-sized, non-autoregressive, and end-to-end (vocoder-free) TTS teacher model. Specifically, we offer module-wise distillation, enabling flexible and independent distillation to the encoder and decoder module. The resulting Nix-TTS inherited the advantageous properties of being non-autoregressive and end-to-end from the teacher, yet significantly smaller in size, with only 5.23M parameters or up to 89.34% reduction of the teacher model; it also achieves over 3.04$times$ and 8.36$times$ inference speedup on Intel-i7 CPU and Raspberry Pi 3B respectively and still retains a fair voice naturalness and intelligibility compared to the teacher model.
Clone the nix-tts repository and move to its directory
git clone https://github.com/rendchevi/nix-tts.git
cd nix-ttsInstall the dependencies
python >= 3.8pip install -r requirements.txt sudo apt-get install espeakOr follow the official instruction in case it didn't work.
Download your chosen pre-trained model here.
| Model | Num. of Params | Faster than real-time* (CPU Intel-i7) | Faster than real-time* (RasPi Model 3B) |
|---|---|---|---|
| Nix-TTS (ONNX) | 5.23 M | 11.9x | 0.50x |
| Nix-TTS w/ Stochastic Duration (ONNX) | 6.03 M | 10.8x | 0.50x |
* Here we compute how much the model run faster than real-time as the inverse of Real Time Factor (RTF). The complete table of all models speedup is detailed on the paper.
And running Nix-TTS is as easy as:
from nix.models.TTS import NixTTSInference
from IPython.display import Audio
# Initiate Nix-TTS
nix = NixTTSInference(model_dir = "<path_to_the_downloaded_model>")
# Tokenize input text
c, c_length, phoneme = nix.tokenize("Born to multiply, born to gaze into night skies.")
# Convert text to raw speech
xw = nix.vocalize(c, c_length)
# Listen to the generated speech
Audio(xw[0,0], rate = 22050)