效果或简称ES是语音(TTS)模型的有效神经文本。它在RPI4上以104(MRTF)或104秒语音的速度生成MEL频谱图。它的微小版本的足迹仅为266K参数 - 仅在现代TTS(例如Mixertts)中约有1%。产生6秒的语音仅消耗90个Mflops。
功效倾斜是一个类似于U-NET的浅(2个块!)金字塔变压器。通过转移的深度可分离卷积来进行上采样。
安装
ES目前正在迁移到Pytorch 2.0和Lightning 2.0。期望不稳定的功能。
pip install -r requirements.txt
如果您遇到了Cublas的问题:
pip uninstall nvidia_cublas_cu11
微小的ES
python3 demo.py --checkpoint https://github.com/roatienza/efficientspeech/releases/download/pytorch2.0.1/tiny_eng_266k.ckpt
--infer-device cpu --text "the quick brown fox jumps over the lazy dog" --wav-filename fox.wav
输出文件在outputs下。播放WAV文件:
ffplay outputs/fox.wav
下载权重后,可以重复使用:
python3 demo.py --checkpoint tiny_eng_266k.ckpt --infer-device cpu
--text "In additive color mixing, which is used for displays such as computer screens and televisions, the primary colors are red, green, and blue."
--wav-filename color.wav
播放:
ffplay outputs/color.wav
小ES
python3 demo.py --checkpoint https://github.com/roatienza/efficientspeech/releases/download/pytorch2.0.1/small_eng_952k.ckpt
--infer-device cpu --n-blocks 3 --reduction 2
--text "Bees are essential pollinators responsible for fertilizing plants and facilitating the growth of fruits, vegetables, and flowers. Their sophisticated social structures and intricate communication systems make them fascinating and invaluable contributors to ecosystems worldwide."
--wav-filename bees.wav
播放:
ffplay outputs/color-small.wav
基础ES
python3 demo.py --checkpoint https://github.com/roatienza/efficientspeech/releases/download/pytorch2.0.1/base_eng_4M.ckpt
--head 2 --reduction 1 --expansion 2 --kernel-size 5 --n-blocks 3 --block-depth 3 --infer-device cpu
--text "Why do bees have sticky hair?" --wav-filename bees-base.wav
播放:
ffplay outputs/bees-base.wav
GPU进行推理
并带有长篇文字。在A100上,可以达到RTF> 1,300。时间使用--iter 100选项。
python3 demo.py --checkpoint small_eng_952k.ckpt
--infer-device cuda --n-blocks 3 --reduction 2
--text "Once upon a time, in a magical forest filled with colorful flowers and sparkling streams, there lived a group of adorable kittens. Their names were Fluffy, Sparkle, and Whiskers. With their soft fur and twinkling eyes, they charmed everyone they met. Every day, they would play together, chasing their tails and pouncing on sunbeams that danced through the trees. Their purrs filled the forest with joy, and all the woodland creatures couldn't help but smile whenever they saw the cute trio. The animals knew that these kittens were truly the epitome of cuteness, bringing happiness wherever they went."
--wav-filename cats.wav --iter 100
在培训或推理期间使用--compile支持编译的选项。对于训练,急切的模式更快。 A100的小型版本培训约为17小时。对于推断,编译版本更快。出于未知原因,编译选项是在--infer-device cuda时生成错误。
默认情况下,Pytorch 2.0使用128个CPU线程(RPI4中的AMD,4,4),这会在推理过程中降低。在推断期间,建议将其设置为较低的数字。例如: --threads 24 。
Pytorch 2.0在RPI4上较慢。请使用演示版本和ICASSP2023型号权重。
Pytorch 2.0上的RTF为〜1.0。 pytorch 1.12上的RTF为〜1.7。
另外,请使用ONNX版本:
python3 demo.py --checkpoint https://github.com/roatienza/efficientspeech/releases/download/pytorch2.0.1/tiny_eng_266k.onnx
--infer-device cpu --text "the primary colors are red, green, and blue." --wav-filename primary.wav
仅支持固定的输入音素长度。如果需要,将填充或截断。使用--onnx-insize=<desired valu>修改。默认最大音素长度为128。例如:
python3 convert.py --checkpoint tiny_eng_266k.ckpt --onnx tiny_eng_266k.onnx --onnx-insize 256
选择一个数据集文件夹:例如<data_folder> = /data/tts - 将存储数据集的目录。
下载ljspeech:
cd <data_folder>
wget https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2
tar zxvf LJSpeech-1.1.tar.bz2
准备数据集: <parent_folder> - git克隆的效率上的位置。
cd <parent_folder>/efficientspeech
编辑config/LJSpeech/preprocess.yaml :
>>>>>>>>>>>>>>>>>
path:
corpus_path: "/data/tts/LJSpeech-1.1"
lexicon_path: "lexicon/librispeech-lexicon.txt"
raw_path: "/data/tts/LJSpeech-1.1/wavs"
preprocessed_path: "./preprocessed_data/LJSpeech"
>>>>>>>>>>>>>>>>
用您的<data_folder>替换/data/tts 。
从此处下载到preprocessed_data/LJSpeech/TextGrid对齐数据。
准备数据集:
python3 prepare_align.py config/LJSpeech/preprocess.yaml
这将需要一个小时左右。
有关更多信息:FastSpeech2实现以准备数据集。
微小的ES
默认情况下:
--precision=16 。其他选项: "bf16-mixed", "16-mixed", 16, 32, 64 。--accelerator=gpu--infer-device=cuda--devices=1utils/tools.py中查看更多选项。 python3 train.py
小ES
python3 train.py --n-blocks 3 --reduction 2
基础ES
python3 train.py --head 2 --reduction 1 --expansion 2 --kernel-size 5 --n-blocks 3 --block-depth 3
ES VS FS2 VS Portaspeech vs LightSpeech
如果您发现这项工作有用,请引用:
@inproceedings{atienza2023efficientspeech,
title={EfficientSpeech: An On-Device Text to Speech Model},
author={Atienza, Rowel},
booktitle={ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
pages={1--5},
year={2023},
organization={IEEE}
}