Téléchargement de zerovox - Téléchargement du code source zerovox

zerovox

Code Source AI

1.0.0

Télécharger

Zerovox: un système TTS en temps réel zéro, entièrement hors ligne, gratuit et open source

Zerovox est un système de texte à dispection (TTS) conçu pour une utilisation en temps réel et intégré.

Zerovox fonctionne entièrement hors ligne, garantissant la confidentialité et l'indépendance des services cloud. C'est entièrement gratuit et open source, invitant les contributions et les suggestions communautaires.

Modélisé après FastSpeech2, Zerovox va plus loin avec un clonage de haut-parleur zéro, en utilisant des jetons de style global (GST) et une normalisation de la couche conditionnelle du haut-parleur (SCLN) pour l'intégration efficace des haut-parleurs. Le système prend en charge la génération de parole anglaise et allemande à partir d'un seul modèle, formé sur un ensemble de données étendu. Zerovox est basé sur des phonèmes, tirant parti des dictionnaires de prononciation pour assurer une articulation précise des mots, en utilisant le dictionnaire CMU pour l'anglais et un dictionnaire personnalisé pour l'allemand du projet Zamiaspeech, où également l'ensemble de phonèmes utilisé.

Zerovox peut servir de backend TTS pour les LLM, permettant des interactions en temps réel et un système TTS facile à installer pour les systèmes de domaine domestique comme l'assistant à domicile. Puisqu'il n'est pas autorégressif comme FastSpeech2, sa sortie est généralement facile à contrôler et prévisible.

Licence: Zerovox est Apache 2 sous licence de nombreuses pièces tirées des autres projets (voir la section des crédits ci-dessous) sous licence MIT.

Démo

Veuillez noter: le modèle est toujours en phase alpha et en train de s'entraîner.

https://huggingface.co/spaces/gooofy/zerovox-demo

Statistiques du corpus audio

Statistiques actuelles du corpus de formation Zerovox:

 german  audio corpus: 16679 speakers, 475.3 hours audio
english audio corpus: 19899 speakers, 358.7 hours audio

Formation du modèle Zerovox

Préparation des données

(1/5) Préparer le corpus Yamls:

 pushd configs/corpora/cv_de_100
./gen_cv.sh
popd

(2/5) Préparer l'alignement:

 utils/prepare_align.py configs/corpora/cv_de_100

(3/5) Oovs:

 utils/oovtool.py -a -m zerovox-g2p-autoreg-zamia-de configs/corpora/cv_de_100

(4/5) Aligner:

 utils/align.py --kaldi-model=tts_de_kaldi_zamia_4 configs/corpora/cv_de_100

(5/5) PRÉMOISSANCE:

 utils/preprocess.py configs/corpora/cv_de_100

Formation du modèle TTS

 utils/train_tts.py 
    --head=2 --reduction=1 --expansion=2 --kernel-size=5 --n-blocks=3 --block-depth=3 
    --accelerator=gpu --threads=24 --batch-size=32 --val_epochs=8 
    --infer-device=cpu 
    --lr=0.0001 --warmup_epochs=25 
    --hifigan-checkpoint=VCTK_V2 
    --out-folder=models/tts_de_zerovox_base_1 
    configs/corpora/cv_de_100 
    configs/corpora/de_hui/de_hui_*.yaml 
    configs/corpora/de_thorsten.yaml

Formation du modèle acoustique kaldi

 utils/train_kaldi.py --model-name=tts_de_kaldi_zamia_4 --num-jobs=12 configs/corpora/cv_de_100

Formation du modèle G2P

Run Training:

 scripts/train_g2p_de_autoreg.sh

Crédits

À l'origine basée sur des efficients de la période par Rowel atienza

https://github.com/roatienza/efficientspeech

 @inproceedings{atienza2023efficientspeech,
  title={EfficientSpeech: An On-Device Text to Speech Model},
  author={Atienza, Rowel},
  booktitle={ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  pages={1--5},
  year={2023},
  organization={IEEE}
}

L'encodeur et le décodeur FastSpeech2 sont empruntés (sous licence MIT) à la mise en œuvre de Chien Chien de FastSpeech2 par Chien Chien

https://github.com/ming024/fastSpeech2

 @misc{ren2022fastspeech2fasthighquality,
    title={FastSpeech 2: Fast and High-Quality End-to-End Text to Speech}, 
    author={Yi Ren and Chenxu Hu and Xu Tan and Tao Qin and Sheng Zhao and Zhou Zhao and Tie-Yan Liu},
    year={2022},
    eprint={2006.04558},
    archivePrefix={arXiv},
    primaryClass={eess.AS},
    url={https://arxiv.org/abs/2006.04558}, 
}

La mise en œuvre de Mel Decoder est empruntée (sous licence MIT) au projet parallèle de Tomoki Hayashi:

https://github.com/kan-bayashi/parallelwavegan Les modèles G2P Transformer sont basés sur Deepphonemizer par Axel Springer News Media & Tech GmbH & Co. KG - Ideas Engineering (MIT Licence)

https://github.com/as-ideas/deepphonizer

 @inproceedings{Yolchuyeva_2019, series={interspeech_2019},
title={Transformer Based Grapheme-to-Phoneme Conversion},
url={http://dx.doi.org/10.21437/Interspeech.2019-1954},
DOI={10.21437/interspeech.2019-1954},
booktitle={Interspeech 2019},
publisher={ISCA},
author={Yolchuyeva, Sevinj and Németh, Géza and Gyires-Tóth, Bálint},
year={2019},
month=sep, pages={2095–2099},
collection={interspeech_2019} }

Le codage de haut-parleur basé sur Reroshot Resnet est emprunté (sous licence MIT) à Voxceleb_Trainer par Clova AI Research

https://github.com/clovaai/voxceleb_trainer

 @inproceedings{chung2020in,
title={In defence of metric learning for speaker recognition},
author={Chung, Joon Son and Huh, Jaesung and Mun, Seongkyu and Lee, Minjae and Heo, Hee Soo and Choe, Soyeon and Ham, Chiheon and Jung, Sunghwan and Lee, Bong-Jin and Han, Icksang},
booktitle={Proc. Interspeech},
year={2020}
}

@inproceedings{he2016deep,
title={Deep residual learning for image recognition},
author={He, Kaiming and Zhang, Xiangyu and Ren, Shaoqing and Sun, Jian},
booktitle={IEEE Conference on Computer Vision and Pattern Recognition},
pages={770--778},
year={2016}
}

L'intégration du haut-parleur basé sur les jetons de style Zeroshot Global Style est basé sur GST-Tacotron par Chengqi Deng (licence MIT)

https://github.com/kinglittleq/gst-tacotron

qui est une implémentation de

 @misc{wang2018style,
	  title={Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis},
	  author={Yuxuan Wang and Daisy Stanton and Yu Zhang and RJ Skerry-Ryan and Eric Battenberg and Joel Shor and Ying Xiao and Fei Ren and Ye Jia and Rif A. Saurous},
	  year={2018},
	  eprint={1803.09017},
	  archivePrefix={arXiv},
	  primaryClass={cs.CL}
}

Normalisation de la couche conditionnelle du haut-parleur (SCLN) qui est empruntée (sous licence MIT) à

https://github.com/keonlee9420/cross-peaker-emotion-transfer par Keon Lee

 @misc{wu2021crossspeakeremotiontransferbased,
    title={Cross-speaker Emotion Transfer Based on Speaker Condition Layer Normalization and Semi-Supervised Training in Text-To-Speech}, 
    author={Pengfei Wu and Junjie Pan and Chenchang Xu and Junhui Zhang and Lin Wu and Xiang Yin and Zejun Ma},
    year={2021},
    eprint={2110.04153},
    archivePrefix={arXiv},
    primaryClass={eess.AS},
    url={https://arxiv.org/abs/2110.04153}, 
}

Développer

Informations supplémentaires

Version 1.0.0
Type Code Source AI
Date de mise à jour 2025-09-15
taille 27.01MB
Provenant de Github

Applications connexes

ML stack

2025-07-01
awesome free chatgpt

2025-01-04
pywin_contextmenu

2025-08-31
promptl

2025-02-17
tick.chat

2025-09-16
FastLoRAChat

2025-09-03

Recommandé pour vous

chat.petals.dev

Autre code source

1.0.0
GPT Prompt Templates

Autre code source

1.0.0
GPTyped

Autre code source

GPTyped 1.0.5
ML stack

Code Source AI

1.0.0
awesome free chatgpt

Code Source AI

1.0.0
pywin_contextmenu

Code Source AI

Version update
Google Dorks

Autre code source

1.0
shepherd

Autre code source

v6.1.6-react-shepherd: Prepare Release (#3063)
mongo express

Autre code source

v1.1.0-rc-3

Actualités connexes Tout