zerovox下载zerovox源代码下载

zerovox

Ai源码

1.0.0

下载

Zerovox：零射击实时TTS系统，完全离线，免费和开源

Zerovox是用于实时和嵌入式使用的文本对语音（TTS）系统。

Zerovox完全离线运行，确保隐私和脱离云服务的独立性。它是完全免费和开源的，邀请社区贡献和建议。

Zerovox以FastSpeech2的形式建模，更进一步，使用零拍扬声器克隆，利用全球样式令牌（GST）和扬声器条件层归一化（SCLN）进行有效的扬声器嵌入。该系统从单个模型中支持英语和德语语音生成，并在广泛的数据集中训练。 Zerovox是基于音素的，利用发音词典来确保精确的单词发音，利用英语的CMU字典，以及Zamiaspeech项目中的德语自定义字典，其中使用的音素集也来自其中。

Zerovox可以用作LLM的TTS后端，实现实时互动，并作为家庭自动化系统（例如家庭助理）易于安装的TTS系统。由于它不像FastSpeech2那样是无助的，其输出通常易于控制和可预测。

许可证：Zerovox是Apache 2，并根据MIT许可证获得了其他项目（请参阅下面的信用部分）的许多零件。

演示

请注意：模型仍处于Alpha阶段，仍在训练。

https://huggingface.co/spaces/gooooofy/zerovox-demo

音频语料库统计

当前的Zerovox培训语料库统计：

 german  audio corpus: 16679 speakers, 475.3 hours audio
english audio corpus: 19899 speakers, 358.7 hours audio

Zerovox模型培训

数据准备

（1/5）准备语料库：

 pushd configs/corpora/cv_de_100
./gen_cv.sh
popd

（2/5）准备对齐：

 utils/prepare_align.py configs/corpora/cv_de_100

（3/5）OOV：

 utils/oovtool.py -a -m zerovox-g2p-autoreg-zamia-de configs/corpora/cv_de_100

（4/5）对齐：

 utils/align.py --kaldi-model=tts_de_kaldi_zamia_4 configs/corpora/cv_de_100

（5/5）预处理：

 utils/preprocess.py configs/corpora/cv_de_100

TTS模型培训

 utils/train_tts.py 
    --head=2 --reduction=1 --expansion=2 --kernel-size=5 --n-blocks=3 --block-depth=3 
    --accelerator=gpu --threads=24 --batch-size=32 --val_epochs=8 
    --infer-device=cpu 
    --lr=0.0001 --warmup_epochs=25 
    --hifigan-checkpoint=VCTK_V2 
    --out-folder=models/tts_de_zerovox_base_1 
    configs/corpora/cv_de_100 
    configs/corpora/de_hui/de_hui_*.yaml 
    configs/corpora/de_thorsten.yaml

卡尔迪声学模型培训

 utils/train_kaldi.py --model-name=tts_de_kaldi_zamia_4 --num-jobs=12 configs/corpora/cv_de_100

G2P模型培训

运行训练：

 scripts/train_g2p_de_autoreg.sh

学分

最初是基于Rowel Atienza的效力

https://github.com/roatienza/efficientspeech

 @inproceedings{atienza2023efficientspeech,
  title={EfficientSpeech: An On-Device Text to Speech Model},
  author={Atienza, Rowel},
  booktitle={ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  pages={1--5},
  year={2023},
  organization={IEEE}
}

FastSpeech2编码器和解码器是从Chung-Ming Chien实施FastSpeech2借（根据MIT许可证）

https://github.com/ming024/fastspeech2

 @misc{ren2022fastspeech2fasthighquality,
    title={FastSpeech 2: Fast and High-Quality End-to-End Text to Speech}, 
    author={Yi Ren and Chenxu Hu and Xu Tan and Tao Qin and Sheng Zhao and Zhou Zhao and Tie-Yan Liu},
    year={2022},
    eprint={2006.04558},
    archivePrefix={arXiv},
    primaryClass={eess.AS},
    url={https://arxiv.org/abs/2006.04558}, 
}

MEL解码器的实施是（根据MIT许可证）从Tomoki Hayashi的Parallelwavegan项目中借出来的：

https://github.com/kan-bayashi/parallelwavegan G2P变压器模型基于Axel Springer News News Media＆Tech Gmbh＆Co。KG基于Deepphonemizer -deepphonemizer-创意工程（MIT许可证）

https://github.com/as-ideas/deepphonemizer

 @inproceedings{Yolchuyeva_2019, series={interspeech_2019},
title={Transformer Based Grapheme-to-Phoneme Conversion},
url={http://dx.doi.org/10.21437/Interspeech.2019-1954},
DOI={10.21437/interspeech.2019-1954},
booktitle={Interspeech 2019},
publisher={ISCA},
author={Yolchuyeva, Sevinj and Németh, Géza and Gyires-Tóth, Bálint},
year={2019},
month=sep, pages={2095–2099},
collection={interspeech_2019} }

Clova AI Research从Voxceleb_trainer那里借了基于Zeroshot Resnet的扬声器编码（根据MIT许可）

https://github.com/clovaai/voxceleb_trainer

 @inproceedings{chung2020in,
title={In defence of metric learning for speaker recognition},
author={Chung, Joon Son and Huh, Jaesung and Mun, Seongkyu and Lee, Minjae and Heo, Hee Soo and Choe, Soyeon and Ham, Chiheon and Jung, Sunghwan and Lee, Bong-Jin and Han, Icksang},
booktitle={Proc. Interspeech},
year={2020}
}

@inproceedings{he2016deep,
title={Deep residual learning for image recognition},
author={He, Kaiming and Zhang, Xiangyu and Ren, Shaoqing and Sun, Jian},
booktitle={IEEE Conference on Computer Vision and Pattern Recognition},
pages={770--778},
year={2016}
}

基于Zeroshot全球标记的扬声器嵌入基于Chengqi Deng（MIT许可证）的GST-TACOTRON

https://github.com/kinglittleq/gst-tacotron

这是实施

 @misc{wang2018style,
	  title={Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis},
	  author={Yuxuan Wang and Daisy Stanton and Yu Zhang and RJ Skerry-Ryan and Eric Battenberg and Joel Shor and Ying Xiao and Fei Ren and Ye Jia and Rif A. Saurous},
	  year={2018},
	  eprint={1803.09017},
	  archivePrefix={arXiv},
	  primaryClass={cs.CL}
}

扬声器条件层归一化（SCLN），该层（根据MIT许可）从

https://github.com/keonlee9420/cross-speaker-emotion-transfer by keon lee

 @misc{wu2021crossspeakeremotiontransferbased,
    title={Cross-speaker Emotion Transfer Based on Speaker Condition Layer Normalization and Semi-Supervised Training in Text-To-Speech}, 
    author={Pengfei Wu and Junjie Pan and Chenchang Xu and Junhui Zhang and Lin Wu and Xiang Yin and Zejun Ma},
    year={2021},
    eprint={2110.04153},
    archivePrefix={arXiv},
    primaryClass={eess.AS},
    url={https://arxiv.org/abs/2110.04153}, 
}

展开

附加信息