efficientspeech Download - efficientspeech Code source Téléchargement

efficientspeech

Code Source AI

efficientspeech-0.2.1

Télécharger

EfficientsPEECH: un modèle de texte sur discours sur le modèle

EfficientsPeEch , ou ES pour court, est un modèle de texte neuronal efficace à la parole (TTS). Il génère le spectrogramme MEL à une vitesse de 104 (MRTF) ou 104 secondes de discours par seconde sur un RPI4. Sa petite version a une empreinte de seulement 266k paramètres - environ 1% seulement des TT modernes tels que Mixertts. La génération de 6 secondes de parole consomme 90 Mflops uniquement.

Papier

IEEE Xplore
Arxiv

Architecture modèle

EfficientsPeEch est un transformateur pyramide peu profond (2 blocs!) Ressemblant à un réseau U. L'échantillonnage est effectué par une convolution séparable transposée en profondeur.

Démo rapide

Installer

ES migre actuellement vers Pytorch 2.0 et Lightning 2.0. Attendez-vous à des fonctionnalités instables.

 pip install -r requirements.txt

Si vous avez rencontré des problèmes avec Cublil:

 pip uninstall nvidia_cublas_cu11

Minuscules es

 python3 demo.py --checkpoint https://github.com/roatienza/efficientspeech/releases/download/pytorch2.0.1/tiny_eng_266k.ckpt 
  --infer-device cpu --text "the quick brown fox jumps over the lazy dog" --wav-filename fox.wav

Le fichier de sortie est sous outputs . Lisez le fichier WAV:

 ffplay outputs/fox.wav

Après avoir téléchargé les poids, il peut être réutilisé:

 python3 demo.py --checkpoint tiny_eng_266k.ckpt --infer-device cpu  
  --text "In additive color mixing, which is used for displays such as computer screens and televisions, the primary colors are red, green, and blue." 
  --wav-filename color.wav

Lecture:

 ffplay outputs/color.wav

Petits es

 python3 demo.py --checkpoint https://github.com/roatienza/efficientspeech/releases/download/pytorch2.0.1/small_eng_952k.ckpt 
  --infer-device cpu  --n-blocks 3 --reduction 2  
  --text "Bees are essential pollinators responsible for fertilizing plants and facilitating the growth of fruits, vegetables, and flowers. Their sophisticated social structures and intricate communication systems make them fascinating and invaluable contributors to ecosystems worldwide." 
  --wav-filename bees.wav

Lecture:

 ffplay outputs/color-small.wav

Base es

 python3 demo.py --checkpoint https://github.com/roatienza/efficientspeech/releases/download/pytorch2.0.1/base_eng_4M.ckpt 
  --head 2 --reduction 1 --expansion 2 --kernel-size 5 --n-blocks 3 --block-depth 3 --infer-device cpu  
  --text "Why do bees have sticky hair?" --wav-filename  bees-base.wav

Lecture:

 ffplay outputs/bees-base.wav

GPU pour l'inférence

Et avec un long texte. Sur un A100, cela peut atteindre RTF> 1 300. Temps en utilisant l'option --iter 100 .

 python3 demo.py --checkpoint small_eng_952k.ckpt  
  --infer-device cuda  --n-blocks 3 --reduction 2  
  --text "Once upon a time, in a magical forest filled with colorful flowers and sparkling streams, there lived a group of adorable kittens. Their names were Fluffy, Sparkle, and Whiskers. With their soft fur and twinkling eyes, they charmed everyone they met. Every day, they would play together, chasing their tails and pouncing on sunbeams that danced through the trees. Their purrs filled the forest with joy, and all the woodland creatures couldn't help but smile whenever they saw the cute trio. The animals knew that these kittens were truly the epitome of cuteness, bringing happiness wherever they went."   
  --wav-filename cats.wav --iter 100

Compiler et nombre d'options de threads

L'option compilée est prise en charge en utilisant --compile pendant la formation ou l'inférence. Pour la formation, le mode impatient est plus rapide. La formation de la petite version est ~ 17 heures sur un A100. Pour l'inférence, la version compilée est plus rapide. Pour une raison inconnue, l'option de compilation génère des erreurs lorsque --infer-device cuda .

Par défaut, Pytorch 2.0 utilise 128 threads CPU (AMD, 4 en RPI4), ce qui provoque un ralentissement pendant l'inférence. Pendant l'inférence, il est recommandé de le régler sur un nombre inférieur. Par exemple: --threads 24 .

Inférence RPI4

Pytorch 2.0 est plus lent sur RPI4. Veuillez utiliser la version de démonstration et les poids du modèle ICASSP2023.

RTF sur Pytorch 2.0 est ~ 1,0. RTF sur Pytorch 1.12 est ~ 1,7.

Alternativement, veuillez utiliser la version ONNX:

 python3 demo.py --checkpoint https://github.com/roatienza/efficientspeech/releases/download/pytorch2.0.1/tiny_eng_266k.onnx 
  --infer-device cpu  --text "the primary colors are red, green, and blue."  --wav-filename primary.wav

Onnx

Prend en charge uniquement la longueur du phonème d'entrée fixe. Le rembourrage ou la troncature est appliqué si nécessaire. Modifiez l'utilisation de --onnx-insize=<desired valu> . La longueur du phonème maximum par défaut est 128. Par exemple:

 python3 convert.py --checkpoint tiny_eng_266k.ckpt --onnx tiny_eng_266k.onnx --onnx-insize 256

Préparation de l'ensemble de données

Choisissez un dossier de jeu de données: par exemple, le répertoire <data_folder> = /data/tts - où l'ensemble de données sera stocké.

Télécharger LJSpeech:

 cd <data_folder>
wget https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2
tar zxvf LJSpeech-1.1.tar.bz2

Préparez l'ensemble de données: <parent_folder> - où EfficientsPeech a été cloné Git.

 cd <parent_folder>/efficientspeech

Modifier config/LJSpeech/preprocess.yaml :

 >>>>>>>>>>>>>>>>>
path:
  corpus_path: "/data/tts/LJSpeech-1.1"
  lexicon_path: "lexicon/librispeech-lexicon.txt"
  raw_path: "/data/tts/LJSpeech-1.1/wavs"
  preprocessed_path: "./preprocessed_data/LJSpeech"
>>>>>>>>>>>>>>>>

Remplacer /data/tts par votre <data_folder> .

Téléchargez les données d'alignement sur preprocessed_data/LJSpeech/TextGrid à partir d'ici.

Préparez l'ensemble de données:

 python3 prepare_align.py config/LJSpeech/preprocess.yaml

Cela prendra environ une heure.

Pour plus d'informations: Implémentation FastSpeech2 pour préparer l'ensemble de données.

Former

Minuscules es

Par défaut:

--precision=16 . Autres options: "bf16-mixed", "16-mixed", 16, 32, 64 .
--accelerator=gpu
--infer-device=cuda
--devices=1
Voir plus d'options dans utils/tools.py

 python3 train.py

Petits es

 python3 train.py --n-blocks 3 --reduction 2

Base es

 python3 train.py --head 2 --reduction 1 --expansion 2 --kernel-size 5 --n-blocks 3 --block-depth 3

Comparaison avec d'autres TTs neuronaux SOTA

Es vs fs2 vs portaspaspeech vs LightSpeech

Crédits

FastSpeech2 Github non officiel.

Citation

Si vous trouvez ce travail utile, veuillez citer:

 @inproceedings{atienza2023efficientspeech,
  title={EfficientSpeech: An On-Device Text to Speech Model},
  author={Atienza, Rowel},
  booktitle={ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  pages={1--5},
  year={2023},
  organization={IEEE}
}

Développer

Informations supplémentaires

Version efficientspeech-0.2.1
Type Code Source AI
Date de mise à jour 2025-08-21
taille 4.85MB
Provenant de Github

Applications connexes

ML stack

2025-07-01
awesome free chatgpt

2025-01-04
pywin_contextmenu

2025-08-31
promptl

2025-02-17
tick.chat

2025-09-16
FastLoRAChat

2025-09-03

Recommandé pour vous

chat.petals.dev

Autre code source

1.0.0
GPT Prompt Templates

Autre code source

1.0.0
GPTyped

Autre code source

GPTyped 1.0.5
ML stack

Code Source AI

1.0.0
awesome free chatgpt

Code Source AI

1.0.0
pywin_contextmenu

Code Source AI

Version update
Google Dorks

Autre code source

1.0
shepherd

Autre code source

v6.1.6-react-shepherd: Prepare Release (#3063)
mongo express

Autre code source

v1.1.0-rc-3

Actualités connexes Tout