Download efficientspeech - download do código fonte efficientspeech

efficientspeech

Código-Fonte de IA

efficientspeech-0.2.1

Baixar

EFICIENTESPEECH: um modelo de texto no dispositivo para o modelo de fala

EFICIENTESPEECH , ou ES, para curta, é um modelo eficiente de texto neural para fala (TTS). Ele gera espectrograma MEL a uma velocidade de 104 (mRTF) ou 104 s de fala por segundo em um RPI4. Sua versão minúscula tem uma pegada de apenas 266k parâmetros - cerca de 1% apenas dos TTs modernos, como o Mixertts. A geração de 6 segundos de fala consome apenas 90 MFlops.

Papel

IEEE Xplore
Arxiv

Arquitetura de modelo

O eficientspeech é um transformador de pirâmide raso (2 blocos!), Com uma rede U. A amostragem é feita por uma convolução separável em profundidade transposta.

Demonstração rápida

Instalar

Atualmente, o ES está migrando para Pytorch 2.0 e Lightning 2.0. Espere recursos instáveis.

 pip install -r requirements.txt

Se você encontrou problemas com Cublas:

 pip uninstall nvidia_cublas_cu11

Minúsculo es

 python3 demo.py --checkpoint https://github.com/roatienza/efficientspeech/releases/download/pytorch2.0.1/tiny_eng_266k.ckpt 
  --infer-device cpu --text "the quick brown fox jumps over the lazy dog" --wav-filename fox.wav

O arquivo de saída está sob outputs . Jogue o arquivo WAV:

 ffplay outputs/fox.wav

Depois de baixar os pesos, pode ser reutilizado:

 python3 demo.py --checkpoint tiny_eng_266k.ckpt --infer-device cpu  
  --text "In additive color mixing, which is used for displays such as computer screens and televisions, the primary colors are red, green, and blue." 
  --wav-filename color.wav

Reprodução:

 ffplay outputs/color.wav

Pequeno es

 python3 demo.py --checkpoint https://github.com/roatienza/efficientspeech/releases/download/pytorch2.0.1/small_eng_952k.ckpt 
  --infer-device cpu  --n-blocks 3 --reduction 2  
  --text "Bees are essential pollinators responsible for fertilizing plants and facilitating the growth of fruits, vegetables, and flowers. Their sophisticated social structures and intricate communication systems make them fascinating and invaluable contributors to ecosystems worldwide." 
  --wav-filename bees.wav

Reprodução:

 ffplay outputs/color-small.wav

Base es

 python3 demo.py --checkpoint https://github.com/roatienza/efficientspeech/releases/download/pytorch2.0.1/base_eng_4M.ckpt 
  --head 2 --reduction 1 --expansion 2 --kernel-size 5 --n-blocks 3 --block-depth 3 --infer-device cpu  
  --text "Why do bees have sticky hair?" --wav-filename  bees-base.wav

Reprodução:

 ffplay outputs/bees-base.wav

GPU para inferência

E com um texto longo. Em uma A100, isso pode atingir RTF> 1.300. Tempo usando a opção --iter 100 .

 python3 demo.py --checkpoint small_eng_952k.ckpt  
  --infer-device cuda  --n-blocks 3 --reduction 2  
  --text "Once upon a time, in a magical forest filled with colorful flowers and sparkling streams, there lived a group of adorable kittens. Their names were Fluffy, Sparkle, and Whiskers. With their soft fur and twinkling eyes, they charmed everyone they met. Every day, they would play together, chasing their tails and pouncing on sunbeams that danced through the trees. Their purrs filled the forest with joy, and all the woodland creatures couldn't help but smile whenever they saw the cute trio. The animals knew that these kittens were truly the epitome of cuteness, bringing happiness wherever they went."   
  --wav-filename cats.wav --iter 100

Compilar e número de opções de threads

A opção compilada é suportada usando --compile durante o treinamento ou inferência. Para o treinamento, o modo ansioso é mais rápido. O pequeno treinamento da versão é de ~ 17 horas em um A100. Para inferência, a versão compilada é mais rápida. Por um motivo desconhecido, a opção de compilação está gerando erros quando --infer-device cuda .

Por padrão, o Pytorch 2.0 usa 128 threads da CPU (AMD, 4 em RPI4) que causam desaceleração durante a inferência. Durante a inferência, é recomendável configurá -lo como um número mais baixo. Por exemplo: --threads 24 .

Inferência RPI4

O Pytorch 2.0 é mais lento no RPI4. Use a liberação da demonstração e os pesos do modelo ICASSP2023.

O RTF no Pytorch 2.0 é ~ 1,0. RTF no Pytorch 1.12 é ~ 1,7.

Como alternativa, use a versão ONNX:

 python3 demo.py --checkpoint https://github.com/roatienza/efficientspeech/releases/download/pytorch2.0.1/tiny_eng_266k.onnx 
  --infer-device cpu  --text "the primary colors are red, green, and blue."  --wav-filename primary.wav

ONNX

Suporta apenas o comprimento do fonema de entrada fixo. O preenchimento ou truncamento é aplicado, se necessário. Modificar usando --onnx-insize=<desired valu> . O comprimento do fonema máximo padrão é 128. Por exemplo:

 python3 convert.py --checkpoint tiny_eng_266k.ckpt --onnx tiny_eng_266k.onnx --onnx-insize 256

Preparação do conjunto de dados

Escolha uma pasta do conjunto de dados: por exemplo, <data_folder> = /data/tts - diretório em que o conjunto de dados será armazenado.

Baixar LJSpeech:

 cd <data_folder>
wget https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2
tar zxvf LJSpeech-1.1.tar.bz2

Prepare o conjunto de dados: <parent_folder> - onde o eficientpeech foi clonado do git.

 cd <parent_folder>/efficientspeech

Editar config/LJSpeech/preprocess.yaml :

 >>>>>>>>>>>>>>>>>
path:
  corpus_path: "/data/tts/LJSpeech-1.1"
  lexicon_path: "lexicon/librispeech-lexicon.txt"
  raw_path: "/data/tts/LJSpeech-1.1/wavs"
  preprocessed_path: "./preprocessed_data/LJSpeech"
>>>>>>>>>>>>>>>>

Substitua /data/tts pelo seu <data_folder> .

Faça o download dos dados de alinhamento para preprocessed_data/LJSpeech/TextGrid a partir daqui.

Prepare o conjunto de dados:

 python3 prepare_align.py config/LJSpeech/preprocess.yaml

Isso levará uma hora ou mais.

Para mais informações: implementação do FastSpeech2 para preparar o conjunto de dados.

Trem

Minúsculo es

Por padrão:

--precision=16 . Outras opções: "bf16-mixed", "16-mixed", 16, 32, 64 .
--accelerator=gpu
--infer-device=cuda
--devices=1
Veja mais opções em utils/tools.py

 python3 train.py

Pequeno es

 python3 train.py --n-blocks 3 --reduction 2

Base es

 python3 train.py --head 2 --reduction 1 --expansion 2 --kernel-size 5 --n-blocks 3 --block-depth 3

Comparação com outros TTs neurais SOTA

ES VS FS2 vs PortasPaseech vs LightSpeech

Créditos

FastSpeech2 Github não oficial.

Citação

Se você achar esse trabalho útil, cite:

 @inproceedings{atienza2023efficientspeech,
  title={EfficientSpeech: An On-Device Text to Speech Model},
  author={Atienza, Rowel},
  booktitle={ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  pages={1--5},
  year={2023},
  organization={IEEE}
}

Expandir

Informações adicionais

Versão efficientspeech-0.2.1
Tipo Código-Fonte de IA
Data da Última Atualização 2025-08-21
tamanho 4.85MB
Vindo de Github

Aplicativos Relacionados

ML stack

2025-07-01
awesome free chatgpt

2025-01-04
pywin_contextmenu

2025-08-31
promptl

2025-02-17
tick.chat

2025-09-16
FastLoRAChat

2025-09-03

Recomendado para você

chat.petals.dev

Outro código-fonte

1.0.0
GPT Prompt Templates

Outro código-fonte

1.0.0
GPTyped

Outro código-fonte

GPTyped 1.0.5
ML stack

Código-Fonte de IA

1.0.0
awesome free chatgpt

Código-Fonte de IA

1.0.0
pywin_contextmenu

Código-Fonte de IA

Version update
Google Dorks

Outro código-fonte

1.0
shepherd

Outro código-fonte

v6.1.6-react-shepherd: Prepare Release (#3063)
mongo express

Outro código-fonte

v1.1.0-rc-3

Informações Relacionadas Todos