Descarga efficientspeech : descarga de código fuente efficientspeech

efficientspeech

Código Fuente de IA

efficientspeech-0.2.1

Descargar

EfficientSpeech: un modelo de texto en el dispositivo al modelo de habla

EficmentPeech , o ES para abreviar, es un modelo eficiente de texto neuronal a voz (TTS). Genera el espectrograma MEL a una velocidad de 104 (MRTF) o 104 segundos de discurso por segundo en un RPI4. Su pequeña versión tiene una huella de solo 266k parámetros, aproximadamente 1% de TTS modernos, como los mixttts. Generar 6 segundos de discurso consume solo 90 MFLOPS.

Papel

IEEE XPLORE
Arxiv

Arquitectura modelo

Eficientspeech es un transformador piramidal superficial (¡2 bloques!) Se asemeja a una red U. El muestreo ascendente se realiza mediante una convolución separable de profundidad transponida.

Demostración rápida

Instalar

ES está actualmente migrando a Pytorch 2.0 y Lightning 2.0. Espere características inestables.

 pip install -r requirements.txt

Si encontró problemas con Cublas:

 pip uninstall nvidia_cublas_cu11

Tiny ES

 python3 demo.py --checkpoint https://github.com/roatienza/efficientspeech/releases/download/pytorch2.0.1/tiny_eng_266k.ckpt 
  --infer-device cpu --text "the quick brown fox jumps over the lazy dog" --wav-filename fox.wav

El archivo de salida está bajo outputs . Reproduce el archivo WAV:

 ffplay outputs/fox.wav

Después de descargar los pesos, se puede reutilizar:

 python3 demo.py --checkpoint tiny_eng_266k.ckpt --infer-device cpu  
  --text "In additive color mixing, which is used for displays such as computer screens and televisions, the primary colors are red, green, and blue." 
  --wav-filename color.wav

Reproducción:

 ffplay outputs/color.wav

Pequeño es

 python3 demo.py --checkpoint https://github.com/roatienza/efficientspeech/releases/download/pytorch2.0.1/small_eng_952k.ckpt 
  --infer-device cpu  --n-blocks 3 --reduction 2  
  --text "Bees are essential pollinators responsible for fertilizing plants and facilitating the growth of fruits, vegetables, and flowers. Their sophisticated social structures and intricate communication systems make them fascinating and invaluable contributors to ecosystems worldwide." 
  --wav-filename bees.wav

Reproducción:

 ffplay outputs/color-small.wav

Base

 python3 demo.py --checkpoint https://github.com/roatienza/efficientspeech/releases/download/pytorch2.0.1/base_eng_4M.ckpt 
  --head 2 --reduction 1 --expansion 2 --kernel-size 5 --n-blocks 3 --block-depth 3 --infer-device cpu  
  --text "Why do bees have sticky hair?" --wav-filename  bees-base.wav

Reproducción:

 ffplay outputs/bees-base.wav

GPU por inferencia

Y con un texto largo. En un A100, esto puede alcanzar RTF> 1.300. Tiempo que usa --iter 100 opción.

 python3 demo.py --checkpoint small_eng_952k.ckpt  
  --infer-device cuda  --n-blocks 3 --reduction 2  
  --text "Once upon a time, in a magical forest filled with colorful flowers and sparkling streams, there lived a group of adorable kittens. Their names were Fluffy, Sparkle, and Whiskers. With their soft fur and twinkling eyes, they charmed everyone they met. Every day, they would play together, chasing their tails and pouncing on sunbeams that danced through the trees. Their purrs filled the forest with joy, and all the woodland creatures couldn't help but smile whenever they saw the cute trio. The animals knew that these kittens were truly the epitome of cuteness, bringing happiness wherever they went."   
  --wav-filename cats.wav --iter 100

Compilar y número de opciones de subprocesos

La opción compilada se admite utilizando --compile durante el entrenamiento o la inferencia. Para el entrenamiento, el modo ansioso es más rápido. El pequeño entrenamiento de la versión es de ~ 17 horas en un A100. Para inferencia, la versión compilada es más rápida. Por una razón desconocida, la opción de compilación es generar errores cuando --infer-device cuda .

Por defecto, Pytorch 2.0 usa 128 hilos de CPU (AMD, 4 en RPI4) que provoca una desaceleración durante la inferencia. Durante la inferencia, se recomienda establecerlo en un número más bajo. Por ejemplo: --threads 24 .

Inferencia RPI4

Pytorch 2.0 es más lento en RPI4. Utilice la versión de demostración y los pesos del modelo ICASSP2023.

RTF en Pytorch 2.0 es ~ 1.0. RTF en Pytorch 1.12 es ~ 1.7.

Alternativamente, utilice la versión ONNX:

 python3 demo.py --checkpoint https://github.com/roatienza/efficientspeech/releases/download/pytorch2.0.1/tiny_eng_266k.onnx 
  --infer-device cpu  --text "the primary colors are red, green, and blue."  --wav-filename primary.wav

ONNX

Solo admite la longitud de fonema de entrada fija. El relleno o el truncamiento se aplica si es necesario. Modificar usando --onnx-insize=<desired valu> . La longitud de fonema máximo predeterminada es 128. Por ejemplo:

 python3 convert.py --checkpoint tiny_eng_266k.ckpt --onnx tiny_eng_266k.onnx --onnx-insize 256

Preparación de datos

Elija una carpeta del conjunto de datos: por ejemplo <data_folder> = /data/tts - directorio donde se almacenará el conjunto de datos.

Descargar ljspeech:

 cd <data_folder>
wget https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2
tar zxvf LJSpeech-1.1.tar.bz2

Prepare el conjunto de datos: <parent_folder> - donde se clonó EficeCientSpeech.

 cd <parent_folder>/efficientspeech

Editar config/LJSpeech/preprocess.yaml :

 >>>>>>>>>>>>>>>>>
path:
  corpus_path: "/data/tts/LJSpeech-1.1"
  lexicon_path: "lexicon/librispeech-lexicon.txt"
  raw_path: "/data/tts/LJSpeech-1.1/wavs"
  preprocessed_path: "./preprocessed_data/LJSpeech"
>>>>>>>>>>>>>>>>

Reemplazar /data/tts con su <data_folder> .

Descargue datos de alineación a preprocessed_data/LJSpeech/TextGrid desde aquí.

Prepare el conjunto de datos:

 python3 prepare_align.py config/LJSpeech/preprocess.yaml

Esto tomará una hora más o menos.

Para obtener más información: Implementación de FastSpeech2 para preparar el conjunto de datos.

Tren

Tiny ES

Por defecto:

--precision=16 . Otras opciones: "bf16-mixed", "16-mixed", 16, 32, 64 .
--accelerator=gpu
--infer-device=cuda
--devices=1
Ver más opciones en utils/tools.py

 python3 train.py

Pequeño es

 python3 train.py --n-blocks 3 --reduction 2

Base

 python3 train.py --head 2 --reduction 1 --expansion 2 --kernel-size 5 --n-blocks 3 --block-depth 3

Comparación con otros TTS neurales SOTA

ES vs FS2 vs Portaspeech vs LightSpeech

Créditos

FastSpeech2 no oficial GitHub.

Citación

Si encuentra útil este trabajo, cite:

 @inproceedings{atienza2023efficientspeech,
  title={EfficientSpeech: An On-Device Text to Speech Model},
  author={Atienza, Rowel},
  booktitle={ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  pages={1--5},
  year={2023},
  organization={IEEE}
}

Expandir

Información adicional

Versión efficientspeech-0.2.1
Tipo Código Fuente de IA
Fecha de actualización 2025-08-21
tamaño 4.85MB
Proviene de Github

Aplicaciones relacionadas

ML stack

2025-07-01
awesome free chatgpt

2025-01-04
pywin_contextmenu

2025-08-31
promptl

2025-02-17
tick.chat

2025-09-16
FastLoRAChat

2025-09-03

Recomendado para ti

chat.petals.dev

Otro código fuente

1.0.0
GPT Prompt Templates

Otro código fuente

1.0.0
GPTyped

Otro código fuente

GPTyped 1.0.5
ML stack

Código Fuente de IA

1.0.0
awesome free chatgpt

Código Fuente de IA

1.0.0
pywin_contextmenu

Código Fuente de IA

Version update
Google Dorks

Otro código fuente

1.0
shepherd

Otro código fuente

v6.1.6-react-shepherd: Prepare Release (#3063)
mongo express

Otro código fuente

v1.1.0-rc-3

Información relacionada Todo