efficientspeech 다운로드 - efficientspeech 소스 코드 다운로드

efficientspeech

AI 소스 코드

efficientspeech-0.2.1

다운로드

EfficientsPeech : 기기 텍스트 대 음성 모델

효율성 또는 ES 는 Speech to Speech 텍스트 (TTS) 모델입니다. RPI4에서 SEC 당 104 (MRTF) 또는 104 초의 음성 속도로 MEL 스펙트럼을 생성합니다. 작은 버전의 발자국은 266k 매개 변수의 발자국을 가지고 있습니다. 6 초의 음성을 생성하면 90 개의 mflop 만 소비합니다.

종이

IEEE XPLORE
arxiv

모델 아키텍처

Efficientspeech 는 U-Net과 유사한 얕은 (2 블록!) 피라미드 변압기입니다. 상향 샘플링은 파손 된 깊이 측면 분리 가능한 컨볼 루션으로 수행됩니다.

빠른 데모

설치하다

ES는 현재 Pytorch 2.0 및 Lightning 2.0으로 마이그레이션하고 있습니다. 불안정한 기능을 기대하십시오.

 pip install -r requirements.txt

Cublas와 문제가 발생하면 :

 pip uninstall nvidia_cublas_cu11

작은 es

 python3 demo.py --checkpoint https://github.com/roatienza/efficientspeech/releases/download/pytorch2.0.1/tiny_eng_266k.ckpt 
  --infer-device cpu --text "the quick brown fox jumps over the lazy dog" --wav-filename fox.wav

출력 파일이 outputs 아래에 있습니다. WAV 파일 재생 :

 ffplay outputs/fox.wav

가중치를 다운로드 한 후 재사용 할 수 있습니다.

 python3 demo.py --checkpoint tiny_eng_266k.ckpt --infer-device cpu  
  --text "In additive color mixing, which is used for displays such as computer screens and televisions, the primary colors are red, green, and blue." 
  --wav-filename color.wav

재생 :

 ffplay outputs/color.wav

작은 es

 python3 demo.py --checkpoint https://github.com/roatienza/efficientspeech/releases/download/pytorch2.0.1/small_eng_952k.ckpt 
  --infer-device cpu  --n-blocks 3 --reduction 2  
  --text "Bees are essential pollinators responsible for fertilizing plants and facilitating the growth of fruits, vegetables, and flowers. Their sophisticated social structures and intricate communication systems make them fascinating and invaluable contributors to ecosystems worldwide." 
  --wav-filename bees.wav

재생 :

 ffplay outputs/color-small.wav

기본 ES

 python3 demo.py --checkpoint https://github.com/roatienza/efficientspeech/releases/download/pytorch2.0.1/base_eng_4M.ckpt 
  --head 2 --reduction 1 --expansion 2 --kernel-size 5 --n-blocks 3 --block-depth 3 --infer-device cpu  
  --text "Why do bees have sticky hair?" --wav-filename  bees-base.wav

재생 :

 ffplay outputs/bees-base.wav

추론을위한 GPU

그리고 긴 텍스트로. A100에서는 RTF> 1,300에 도달 할 수 있습니다. --iter 100 옵션을 사용하는 시간.

 python3 demo.py --checkpoint small_eng_952k.ckpt  
  --infer-device cuda  --n-blocks 3 --reduction 2  
  --text "Once upon a time, in a magical forest filled with colorful flowers and sparkling streams, there lived a group of adorable kittens. Their names were Fluffy, Sparkle, and Whiskers. With their soft fur and twinkling eyes, they charmed everyone they met. Every day, they would play together, chasing their tails and pouncing on sunbeams that danced through the trees. Their purrs filled the forest with joy, and all the woodland creatures couldn't help but smile whenever they saw the cute trio. The animals knew that these kittens were truly the epitome of cuteness, bringing happiness wherever they went."   
  --wav-filename cats.wav --iter 100

스레드 옵션의 컴파일 및 수

컴파일 된 옵션은 훈련 또는 추론 중에 --compile 사용하여 지원됩니다. 훈련의 경우 열성적인 모드가 더 빠릅니다. 작은 버전 교육은 A100의 ~ 17 시간입니다. 추론을 위해 컴파일 버전이 더 빠릅니다. 알려지지 않은 이유로, 컴파일 옵션은 --infer-device cuda 일 때 오류를 생성하는 것입니다.

기본적으로 Pytorch 2.0은 128 CPU 스레드 (AMD, RPI4의 4)를 사용하여 추론 중에 둔화를 유발합니다. 추론 중에는 더 낮은 숫자로 설정하는 것이 좋습니다. 예를 들면 다음과 같습니다. --threads 24 .

RPI4 추론

Pytorch 2.0은 RPI4에서 느립니다. 데모 릴리스 및 ICASSP2023 모델 가중치를 사용하십시오.

Pytorch 2.0의 RTF는 ~ 1.0입니다. Pytorch 1.12의 RTF는 ~ 1.7입니다.

또는 ONNX 버전을 사용하십시오.

 python3 demo.py --checkpoint https://github.com/roatienza/efficientspeech/releases/download/pytorch2.0.1/tiny_eng_266k.onnx 
  --infer-device cpu  --text "the primary colors are red, green, and blue."  --wav-filename primary.wav

onx

고정 입력 음소 길이 만 지원합니다. 필요한 경우 패딩 또는 잘림이 적용됩니다. --onnx-insize=<desired valu> 사용하여 수정하십시오. 기본 최대 음소 길이는 128입니다. 예를 들어 :

 python3 convert.py --checkpoint tiny_eng_266k.ckpt --onnx tiny_eng_266k.onnx --onnx-insize 256

데이터 세트 준비

데이터 세트 폴더를 선택하십시오 : eG <data_folder> = /data/tts 데이터 세트가 저장 될 디렉토리.

ljspeech 다운로드 :

 cd <data_folder>
wget https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2
tar zxvf LJSpeech-1.1.tar.bz2

데이터 세트를 준비하십시오 : <parent_folder> - 효율성이 클로닝 된 경우.

 cd <parent_folder>/efficientspeech

config/LJSpeech/preprocess.yaml 편집 :

 >>>>>>>>>>>>>>>>>
path:
  corpus_path: "/data/tts/LJSpeech-1.1"
  lexicon_path: "lexicon/librispeech-lexicon.txt"
  raw_path: "/data/tts/LJSpeech-1.1/wavs"
  preprocessed_path: "./preprocessed_data/LJSpeech"
>>>>>>>>>>>>>>>>

/data/tts <data_folder> 로 바꾸십시오.

여기에서 정렬 데이터를 preprocessed_data/LJSpeech/TextGrid 로 다운로드하십시오.

데이터 세트 준비 :

 python3 prepare_align.py config/LJSpeech/preprocess.yaml

이것은 한 시간 정도 걸릴 것입니다.

자세한 정보 : FastSpeech2 구현을위한 데이터 세트를 준비합니다.

기차

작은 es

기본적으로 :

--precision=16 . 기타 옵션 : "bf16-mixed", "16-mixed", 16, 32, 64 .
--accelerator=gpu
--infer-device=cuda
--devices=1
utils/tools.py 에서 더 많은 옵션을 참조하십시오

 python3 train.py

작은 es

 python3 train.py --n-blocks 3 --reduction 2

기본 ES

 python3 train.py --head 2 --reduction 1 --expansion 2 --kernel-size 5 --n-blocks 3 --block-depth 3

다른 Sota 신경 TT와의 비교

ES 대 FS2 vs portaspeech vs lightspeech

크레딧

FastSpeech2 비공식 github.

소환

이 작업이 유용하다고 생각되면 다음을 인용하십시오.

 @inproceedings{atienza2023efficientspeech,
  title={EfficientSpeech: An On-Device Text to Speech Model},
  author={Atienza, Rowel},
  booktitle={ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  pages={1--5},
  year={2023},
  organization={IEEE}
}

확장하다

추가 정보