Efficient Speech Download - Efficient Speech Source code download

Efficient Speech

AI Source Code

1.0.0

Download

EfficientSpeech: An On-Device Text to Speech Model

EfficientSpeech, or ES for short, is an efficient neural text to speech (TTS) model. It generates mel spectrogram at a speed of 104 (mRTF) or 104 secs of speech per sec on an RPi4. Its tiny version has a footprint of just 266k parameters - about 1% only of modern day TTS such as MixerTTS. Generating 6 secs of speech consumes 90 MFLOPS only.

Paper

IEEE Xplore
Arxiv

Model Architecture

EfficientSpeech is a shallow (2 blocks!) pyramid transformer resembling a U-Net. Upsampling is done by a transposed depth-wise separable convolution.

Quick Demo

Install

ES is currently migrating to Pytorch 2.0 and Lightning 2.0. Expect unstable features.

pip install -r requirements.txt

Compile and Number of Threads Options

Compiled option is supported using --compile during training or inference. For training, the eager mode is faster. The tiny version training is ~17hrs on an A100. For inference, the compiled version is faster. For an unknown reason, the compile option is generating errors when --infer-device cuda.

By default, PyTorch 2.0 uses 128 cpu threads (AMD, 4 in RPi4) which causes slowdown during inference. During inference, it is recommended to set it to a lower number. For example: --threads 24.

RPi4 Inference

PyTorch 2.0 is slower on RPi4. Please use the Demo Release and ICASSP2023 model weights.

RTF on PyTorch 2.0 is ~1.0. RTF on PyTorch 1.12 is ~1.7.

Alternatively, please use the onnx version:

python3 demo.py --checkpoint https://github.com/roatienza/efficientspeech/releases/download/pytorch2.0.1/tiny_eng_266k.onnx 
  --infer-device cpu  --text "the primary colors are red, green, and blue."  --wav-filename primary.wav

ONNX

Only supports fixed input phoneme length. Padding or truncation is applied if needed. Modify using --onnx-insize=<desired value>. Default max phoneme length is 128. For example:

python3 convert.py --checkpoint tiny_eng_266k.ckpt --onnx tiny_eng_266k.onnx --onnx-insize 256

Dataset Preparation

Choose a dataset folder: eg <data_folder> = /data/tts - directory where dataset will be stored.

Download Custom KSS Dataset:

cd efficientspeech
mkdir ./data/kss

Download Custom KSS Dataset here

Prepare the dataset: <parent_folder> - where efficientspeech was git cloned.

cd <parent_folder>/efficientspeech

Edit config/LJSpeech/preprocess.yaml:

>>>>>>>>>>>>>>>>>
path:
  corpus_path: "./data/tts/kss"
  lexicon_path: "lexicon/korean-lexicon.txt"
  raw_path: "./data/tts/kss/wavs"
  preprocessed_path: "./preprocessed_data/kss"
>>>>>>>>>>>>>>>>

Replace /data/tts with your <data_folder>.

Download alignment data to preprocessed_data/KSS/TextGrid from here.

Prepare the dataset:

python prepare_align.py config/kss/preprocess.yaml
python preprocess.py config/kss/preprocess.yaml

This will take an hour or so.

For more info: FastSpeech2 implementation to prepare the dataset.

Train

Tiny ES

By default:

--precision=16. Other options: "bf16-mixed", "16-mixed", 16, 32, 64.
--accelerator=gpu
--infer-device=cuda
--devices=1
See more options in utils/tools.py

python3 train.py

Small ES

python3 train.py --n-blocks 3 --reduction 2

Base ES

python3 train.py --head 2 --reduction 1 --expansion 2 --kernel-size 5 --n-blocks 3 --block-depth 3

Inference

python3 demo.py --checkpoint ./lightning_logs/version_2/checkpoints/epoch=4999-step=485000.ckpt --text "그는 괜찮은 척하려고 애 쓰는 것 같았다." --wav-filename base.wav

Comparison with other SOTA Neural TTS

ES vs FS2 vs PortaSpeech vs LightSpeech

Credits

FastSpeech2 Unofficial Github.

References

For more information, please refer to the following repositories:

HGU-DLLAB/Korean-FastSpeech2-Pytorch
carpedm20/multi-speaker-tacotron-tensorflow
Kyubyong/g2pK

To-Do

Fix synthesize.py, korean text2phoneme function [✅]
Support Multi-Speaker Embedding [W.I.P.]
Support Multilingual Cleaners [W.I.P.]

Citation

If you find this work useful, please cite:

@inproceedings{atienza2023efficientspeech,
  title={EfficientSpeech: An On-Device Text to Speech Model},
  author={Atienza, Rowel},
  booktitle={ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  pages={1--5},
  year={2023},
  organization={IEEE}
}

Expand

Additional Information

Version 1.0.0
Type AI Source Code
Update Time 2025-08-21
size 5.1MB
From Github

Related Applications

efficient language detector

2024-11-06
Parameter Efficient Transfer Learning Benchmark

2024-11-06
GitHub sgrebnov/cordova plugin background download

2024-11-05
Wa ch navra maza navsacha 2 2024 ull ovie Online For Fr e Strea ings At Home

2024-11-03
Wa ch the greatest of all time 2024 ull ovie Online For Fr e Strea ings At Home

2024-11-02
wolfs 2024 f llmo ie f lmyz lla dow load ree 7 0p 4 0p a d 10 0p

2024-11-01

Recommended for You

chat.petals.dev

Other source code

1.0.0
GPT Prompt Templates

Other source code

1.0.0
GPTyped

Other source code

GPTyped 1.0.5
ML stack

AI Source Code

1.0.0
awesome free chatgpt

AI Source Code

1.0.0
pywin_contextmenu

AI Source Code

Version update
Google Dorks

Other source code

1.0
shepherd

Other source code

v6.1.6-react-shepherd: Prepare Release (#3063)
mongo express

Other source code

v1.1.0-rc-3

Related Information All