zerovox 다운로드 - zerovox 소스 코드 다운로드

zerovox

AI 소스 코드

1.0.0

다운로드

Zerovox : 제로 샷 실시간 TTS 시스템, 완전히 오프라인, 무료 및 오픈 소스

Zerovox는 실시간 및 임베디드 용도를 위해 구축 된 TTS (Text To-Steeech) 시스템입니다.

Zerovox는 완전히 오프라인으로 실행되어 클라우드 서비스와의 개인 정보 및 독립성을 보장합니다. 완전히 자유롭고 오픈 소스이며 커뮤니티 기여와 제안을 초대합니다.

FastSpeech2 이후 모델링 된 Zerovox는 효과적인 스피커 임베딩을 위해 GST (Global Style Tokens) 및 스피커 조건부 레이어 정규화 (SCLN)를 사용하여 제로 샷 스피커 클로닝으로 한 단계 더 발전합니다. 이 시스템은 광범위한 데이터 세트에서 훈련 된 단일 모델의 영어 및 독일어 음성 생성을 지원합니다. Zerovox는 Phoneme 기반으로 발음 사전을 활용하여 정확한 단어 관절을 보장하여 영어를위한 CMU 사전을 사용하고 Zamiaspeech 프로젝트에서 독일어의 맞춤형 사전을 사용하여 사용 된 음소 세트도 시작합니다.

Zerovox는 LLMS의 TTS 백엔드 역할을하여 실시간 상호 작용을 가능하게하며 Home Assistant와 같은 홈 자동화 시스템을위한 설치하기 쉬운 TTS 시스템으로 사용될 수 있습니다. FastSpeech2와 같은 비유도가 아니기 때문에 출력은 일반적으로 제어하기 쉽고 예측할 수 있습니다.

라이센스 : Zerovox는 MIT 라이센스에 따라 다른 프로젝트 (아래 학점 섹션 참조)에서 많은 부품을 사용하여 라이센스가 부여됩니다.

데모

참고 : 모델은 여전히 알파 단계에 있으며 여전히 훈련합니다.

https://huggingface.co/spaces/goooofy/zerovox-demo

오디오 코퍼스 통계

현재 Zerovox Training Corpus 통계 :

 german  audio corpus: 16679 speakers, 475.3 hours audio
english audio corpus: 19899 speakers, 358.7 hours audio

Zerovox 모델 교육

데이터 준비

(1/5) Corpus Yamls 준비 :

 pushd configs/corpora/cv_de_100
./gen_cv.sh
popd

(2/5) 정렬 준비 :

 utils/prepare_align.py configs/corpora/cv_de_100

(3/5) OOVS :

 utils/oovtool.py -a -m zerovox-g2p-autoreg-zamia-de configs/corpora/cv_de_100

(4/5) 정렬 :

 utils/align.py --kaldi-model=tts_de_kaldi_zamia_4 configs/corpora/cv_de_100

(5/5) 전처리 :

 utils/preprocess.py configs/corpora/cv_de_100

TTS 모델 교육

 utils/train_tts.py 
    --head=2 --reduction=1 --expansion=2 --kernel-size=5 --n-blocks=3 --block-depth=3 
    --accelerator=gpu --threads=24 --batch-size=32 --val_epochs=8 
    --infer-device=cpu 
    --lr=0.0001 --warmup_epochs=25 
    --hifigan-checkpoint=VCTK_V2 
    --out-folder=models/tts_de_zerovox_base_1 
    configs/corpora/cv_de_100 
    configs/corpora/de_hui/de_hui_*.yaml 
    configs/corpora/de_thorsten.yaml

Kaldi AccouStic 모델 교육

 utils/train_kaldi.py --model-name=tts_de_kaldi_zamia_4 --num-jobs=12 configs/corpora/cv_de_100

G2P 모델 교육

실행 훈련 :

 scripts/train_g2p_de_autoreg.sh

크레딧

원래 Rowel Atienza의 효율성을 기반으로합니다

https://github.com/roatienza/efficientspeech

 @inproceedings{atienza2023efficientspeech,
  title={EfficientSpeech: An On-Device Text to Speech Model},
  author={Atienza, Rowel},
  booktitle={ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  pages={1--5},
  year={2023},
  organization={IEEE}
}

FastSpeech2 인코더 및 디코더는 Chung-Ming Chien의 FastSpeech2 구현에서 빌려 왔습니다 (MIT 라이센스에 따라)

https://github.com/ming024/fastspeech2

 @misc{ren2022fastspeech2fasthighquality,
    title={FastSpeech 2: Fast and High-Quality End-to-End Text to Speech}, 
    author={Yi Ren and Chenxu Hu and Xu Tan and Tao Qin and Sheng Zhao and Zhou Zhao and Tie-Yan Liu},
    year={2022},
    eprint={2006.04558},
    archivePrefix={arXiv},
    primaryClass={eess.AS},
    url={https://arxiv.org/abs/2006.04558}, 
}

Mel Decoder 구현은 Tomoki Hayashi의 Parallel Wavegan 프로젝트에서 (MIT 라이센스에 따라) 차용됩니다.

https://github.com/kan-bayashi/parallelwavegan G2P 변압기 모델은 Axel Springer News Media & Tech Gmbh & Co. KG- 아이디어 엔지니어링 (MIT 라이센스)의 DeepPhonemizer를 기반으로합니다.

https://github.com/as-ideas/deepphonemizer

 @inproceedings{Yolchuyeva_2019, series={interspeech_2019},
title={Transformer Based Grapheme-to-Phoneme Conversion},
url={http://dx.doi.org/10.21437/Interspeech.2019-1954},
DOI={10.21437/interspeech.2019-1954},
booktitle={Interspeech 2019},
publisher={ISCA},
author={Yolchuyeva, Sevinj and Németh, Géza and Gyires-Tóth, Bálint},
year={2019},
month=sep, pages={2095–2099},
collection={interspeech_2019} }

Zeroshot Resnet 기반 스피커 인코딩은 Clova AI Research의 Voxceleb_trainer로부터 빌려 왔습니다 (MIT 라이센스에 따라)

https://github.com/clovaai/voxceleb_trainer

 @inproceedings{chung2020in,
title={In defence of metric learning for speaker recognition},
author={Chung, Joon Son and Huh, Jaesung and Mun, Seongkyu and Lee, Minjae and Heo, Hee Soo and Choe, Soyeon and Ham, Chiheon and Jung, Sunghwan and Lee, Bong-Jin and Han, Icksang},
booktitle={Proc. Interspeech},
year={2020}
}

@inproceedings{he2016deep,
title={Deep residual learning for image recognition},
author={He, Kaiming and Zhang, Xiangyu and Ren, Shaoqing and Sun, Jian},
booktitle={IEEE Conference on Computer Vision and Pattern Recognition},
pages={770--778},
year={2016}
}

Zeroshot Global Style Tokens 기반 스피커 임베딩은 Chengqi Deng (MIT 라이센스)의 GST-Tacotron을 기반으로합니다.

https://github.com/kinglittleq/gst-tacotron

이는 구현입니다

 @misc{wang2018style,
	  title={Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis},
	  author={Yuxuan Wang and Daisy Stanton and Yu Zhang and RJ Skerry-Ryan and Eric Battenberg and Joel Shor and Ying Xiao and Fei Ren and Ye Jia and Rif A. Saurous},
	  year={2018},
	  eprint={1803.09017},
	  archivePrefix={arXiv},
	  primaryClass={cs.CL}
}

스피커 조건부 계층 정규화 (SCLN)는 (MIT 라이센스에 따라)

Keon Lee의 https://github.com/keonlee9420/cross-peaker-emotion-transfer

 @misc{wu2021crossspeakeremotiontransferbased,
    title={Cross-speaker Emotion Transfer Based on Speaker Condition Layer Normalization and Semi-Supervised Training in Text-To-Speech}, 
    author={Pengfei Wu and Junjie Pan and Chenchang Xu and Junhui Zhang and Lin Wu and Xiang Yin and Zejun Ma},
    year={2021},
    eprint={2110.04153},
    archivePrefix={arXiv},
    primaryClass={eess.AS},
    url={https://arxiv.org/abs/2110.04153}, 
}

확장하다

추가 정보