效果或簡稱ES是語音(TTS)模型的有效神經文本。它在RPI4上以104(MRTF)或104秒語音的速度生成MEL頻譜圖。它的微小版本的足跡僅為266K參數 - 僅在現代TTS(例如Mixertts)中約有1%。產生6秒的語音僅消耗90個Mflops。
功效傾斜是一個類似於U-NET的淺(2個塊!)金字塔變壓器。通過轉移的深度可分離卷積來進行上採樣。
安裝
ES目前正在遷移到Pytorch 2.0和Lightning 2.0。期望不穩定的功能。
pip install -r requirements.txt
如果您遇到了Cublas的問題:
pip uninstall nvidia_cublas_cu11
微小的ES
python3 demo.py --checkpoint https://github.com/roatienza/efficientspeech/releases/download/pytorch2.0.1/tiny_eng_266k.ckpt
--infer-device cpu --text "the quick brown fox jumps over the lazy dog" --wav-filename fox.wav
輸出文件在outputs下。播放WAV文件:
ffplay outputs/fox.wav
下載權重後,可以重複使用:
python3 demo.py --checkpoint tiny_eng_266k.ckpt --infer-device cpu
--text "In additive color mixing, which is used for displays such as computer screens and televisions, the primary colors are red, green, and blue."
--wav-filename color.wav
播放:
ffplay outputs/color.wav
小ES
python3 demo.py --checkpoint https://github.com/roatienza/efficientspeech/releases/download/pytorch2.0.1/small_eng_952k.ckpt
--infer-device cpu --n-blocks 3 --reduction 2
--text "Bees are essential pollinators responsible for fertilizing plants and facilitating the growth of fruits, vegetables, and flowers. Their sophisticated social structures and intricate communication systems make them fascinating and invaluable contributors to ecosystems worldwide."
--wav-filename bees.wav
播放:
ffplay outputs/color-small.wav
基礎ES
python3 demo.py --checkpoint https://github.com/roatienza/efficientspeech/releases/download/pytorch2.0.1/base_eng_4M.ckpt
--head 2 --reduction 1 --expansion 2 --kernel-size 5 --n-blocks 3 --block-depth 3 --infer-device cpu
--text "Why do bees have sticky hair?" --wav-filename bees-base.wav
播放:
ffplay outputs/bees-base.wav
GPU進行推理
並帶有長篇文字。在A100上,可以達到RTF> 1,300。時間使用--iter 100選項。
python3 demo.py --checkpoint small_eng_952k.ckpt
--infer-device cuda --n-blocks 3 --reduction 2
--text "Once upon a time, in a magical forest filled with colorful flowers and sparkling streams, there lived a group of adorable kittens. Their names were Fluffy, Sparkle, and Whiskers. With their soft fur and twinkling eyes, they charmed everyone they met. Every day, they would play together, chasing their tails and pouncing on sunbeams that danced through the trees. Their purrs filled the forest with joy, and all the woodland creatures couldn't help but smile whenever they saw the cute trio. The animals knew that these kittens were truly the epitome of cuteness, bringing happiness wherever they went."
--wav-filename cats.wav --iter 100
在培訓或推理期間使用--compile支持編譯的選項。對於訓練,急切的模式更快。 A100的小型版本培訓約為17小時。對於推斷,編譯版本更快。出於未知原因,編譯選項是在--infer-device cuda時生成錯誤。
默認情況下,Pytorch 2.0使用128個CPU線程(RPI4中的AMD,4,4),這會在推理過程中降低。在推斷期間,建議將其設置為較低的數字。例如: --threads 24 。
Pytorch 2.0在RPI4上較慢。請使用演示版本和ICASSP2023型號權重。
Pytorch 2.0上的RTF為〜1.0。 pytorch 1.12上的RTF為〜1.7。
另外,請使用ONNX版本:
python3 demo.py --checkpoint https://github.com/roatienza/efficientspeech/releases/download/pytorch2.0.1/tiny_eng_266k.onnx
--infer-device cpu --text "the primary colors are red, green, and blue." --wav-filename primary.wav
僅支持固定的輸入音素長度。如果需要,將填充或截斷。使用--onnx-insize=<desired valu>修改。默認最大音素長度為128。例如:
python3 convert.py --checkpoint tiny_eng_266k.ckpt --onnx tiny_eng_266k.onnx --onnx-insize 256
選擇一個數據集文件夾:例如<data_folder> = /data/tts - 將存儲數據集的目錄。
下載ljspeech:
cd <data_folder>
wget https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2
tar zxvf LJSpeech-1.1.tar.bz2
準備數據集: <parent_folder> - git克隆的效率上的位置。
cd <parent_folder>/efficientspeech
編輯config/LJSpeech/preprocess.yaml :
>>>>>>>>>>>>>>>>>
path:
corpus_path: "/data/tts/LJSpeech-1.1"
lexicon_path: "lexicon/librispeech-lexicon.txt"
raw_path: "/data/tts/LJSpeech-1.1/wavs"
preprocessed_path: "./preprocessed_data/LJSpeech"
>>>>>>>>>>>>>>>>
用您的<data_folder>替換/data/tts 。
從此處下載到preprocessed_data/LJSpeech/TextGrid對齊數據。
準備數據集:
python3 prepare_align.py config/LJSpeech/preprocess.yaml
這將需要一個小時左右。
有關更多信息:FastSpeech2實現以準備數據集。
微小的ES
默認情況下:
--precision=16 。其他選項: "bf16-mixed", "16-mixed", 16, 32, 64 。--accelerator=gpu--infer-device=cuda--devices=1utils/tools.py中查看更多選項。 python3 train.py
小ES
python3 train.py --n-blocks 3 --reduction 2
基礎ES
python3 train.py --head 2 --reduction 1 --expansion 2 --kernel-size 5 --n-blocks 3 --block-depth 3
ES VS FS2 VS Portaspeech vs LightSpeech
如果您發現這項工作有用,請引用:
@inproceedings{atienza2023efficientspeech,
title={EfficientSpeech: An On-Device Text to Speech Model},
author={Atienza, Rowel},
booktitle={ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
pages={1--5},
year={2023},
organization={IEEE}
}