ดาวน์โหลด efficientspeech - efficientspeech Source Source Download

efficientspeech

โค้ดแหล่งที่มา AI

efficientspeech-0.2.1

ดาวน์โหลด

EfficientSpeech: ข้อความแบบออนอุปกรณ์เป็นแบบจำลองการพูด

EfficientSPeech หรือ ES สำหรับระยะสั้นเป็นแบบจำลองข้อความประสาทต่อการพูด (TTS) ที่มีประสิทธิภาพ มันสร้าง MEL spectrogram ด้วยความเร็ว 104 (MRTF) หรือ 104 วินาทีต่อวินาทีต่อวินาทีบน RPI4 รุ่นเล็ก ๆ ของมันมีค่าใช้จ่ายเพียง 266K พารามิเตอร์ - ประมาณ 1% ของ TTS ที่ทันสมัยเช่น Mixertts การสร้างคำพูด 6 วินาทีใช้ 90 MFLOPS เท่านั้น

กระดาษ

ieee xplore
arxiv

สถาปัตยกรรมแบบจำลอง

EfficientsPeech เป็นหม้อแปลงปิรามิดตื้น (2 บล็อก!) คล้ายกับ U-NET การสุ่มตัวอย่างทำได้โดยการแปลงที่แยกออกจากกันอย่างลึกซึ้ง

การสาธิตอย่างรวดเร็ว

ติดตั้ง

ขณะนี้ ES กำลังย้ายไปที่ Pytorch 2.0 และ Lightning 2.0 คาดว่าจะมีคุณสมบัติที่ไม่แน่นอน

 pip install -r requirements.txt

หากคุณพบปัญหากับ Cublas:

 pip uninstall nvidia_cublas_cu11

ES เล็ก

 python3 demo.py --checkpoint https://github.com/roatienza/efficientspeech/releases/download/pytorch2.0.1/tiny_eng_266k.ckpt 
  --infer-device cpu --text "the quick brown fox jumps over the lazy dog" --wav-filename fox.wav

ไฟล์เอาต์พุตอยู่ภายใต้ outputs เล่นไฟล์ WAV:

 ffplay outputs/fox.wav

หลังจากดาวน์โหลดน้ำหนักแล้วสามารถนำกลับมาใช้ใหม่ได้:

 python3 demo.py --checkpoint tiny_eng_266k.ckpt --infer-device cpu  
  --text "In additive color mixing, which is used for displays such as computer screens and televisions, the primary colors are red, green, and blue." 
  --wav-filename color.wav

การเล่น:

 ffplay outputs/color.wav

ES ขนาดเล็ก

 python3 demo.py --checkpoint https://github.com/roatienza/efficientspeech/releases/download/pytorch2.0.1/small_eng_952k.ckpt 
  --infer-device cpu  --n-blocks 3 --reduction 2  
  --text "Bees are essential pollinators responsible for fertilizing plants and facilitating the growth of fruits, vegetables, and flowers. Their sophisticated social structures and intricate communication systems make them fascinating and invaluable contributors to ecosystems worldwide." 
  --wav-filename bees.wav

การเล่น:

 ffplay outputs/color-small.wav

ฐาน ES

 python3 demo.py --checkpoint https://github.com/roatienza/efficientspeech/releases/download/pytorch2.0.1/base_eng_4M.ckpt 
  --head 2 --reduction 1 --expansion 2 --kernel-size 5 --n-blocks 3 --block-depth 3 --infer-device cpu  
  --text "Why do bees have sticky hair?" --wav-filename  bees-base.wav

การเล่น:

 ffplay outputs/bees-base.wav

GPU สำหรับการอนุมาน

และมีข้อความยาว ใน A100 สิ่งนี้สามารถเข้าถึง RTF> 1,300 เวลาใช้ตัวเลือก --iter 100 ตัวเลือก

 python3 demo.py --checkpoint small_eng_952k.ckpt  
  --infer-device cuda  --n-blocks 3 --reduction 2  
  --text "Once upon a time, in a magical forest filled with colorful flowers and sparkling streams, there lived a group of adorable kittens. Their names were Fluffy, Sparkle, and Whiskers. With their soft fur and twinkling eyes, they charmed everyone they met. Every day, they would play together, chasing their tails and pouncing on sunbeams that danced through the trees. Their purrs filled the forest with joy, and all the woodland creatures couldn't help but smile whenever they saw the cute trio. The animals knew that these kittens were truly the epitome of cuteness, bringing happiness wherever they went."   
  --wav-filename cats.wav --iter 100

รวบรวมและจำนวนตัวเลือกเธรด

ตัวเลือกที่รวบรวมได้รับการสนับสนุนโดยใช้ --compile ระหว่างการฝึกอบรมหรือการอนุมาน สำหรับการฝึกอบรมโหมดกระตือรือร้นเร็วขึ้น การฝึกอบรมรุ่นเล็ก ๆ คือ ~ 17 ชั่วโมงบน A100 สำหรับการอนุมานเวอร์ชันที่รวบรวมได้เร็วขึ้น ด้วยเหตุผลที่ไม่ทราบสาเหตุตัวเลือกคอมไพล์คือการสร้างข้อผิดพลาดเมื่อ --infer-device cuda

โดยค่าเริ่มต้น Pytorch 2.0 ใช้เธรด CPU 128 รายการ (AMD, 4 ใน RPI4) ซึ่งทำให้เกิดการชะลอตัวระหว่างการอนุมาน ในระหว่างการอนุมานขอแนะนำให้ตั้งค่าเป็นจำนวนที่ต่ำกว่า ตัวอย่างเช่น: --threads 24

การอนุมาน RPI4

Pytorch 2.0 ช้าลงบน RPI4 โปรดใช้การสาธิตการเปิดตัวและน้ำหนักรุ่น ICASSP2023

RTF บน Pytorch 2.0 คือ ~ 1.0 RTF บน pytorch 1.12 คือ ~ 1.7

อีกทางเลือกหนึ่งโปรดใช้เวอร์ชัน ONNX:

 python3 demo.py --checkpoint https://github.com/roatienza/efficientspeech/releases/download/pytorch2.0.1/tiny_eng_266k.onnx 
  --infer-device cpu  --text "the primary colors are red, green, and blue."  --wav-filename primary.wav

onnx

รองรับความยาวฟอนิมอินพุตคงที่เท่านั้น การขยายหรือการตัดทอนจะถูกนำไปใช้หากจำเป็น แก้ไขโดยใช้ --onnx-insize=<desired valu> ความยาว max phoneme เริ่มต้นคือ 128 ตัวอย่างเช่น:

 python3 convert.py --checkpoint tiny_eng_266k.ckpt --onnx tiny_eng_266k.onnx --onnx-insize 256

การเตรียมชุดข้อมูล

เลือกโฟลเดอร์ชุดข้อมูล: เช่น <data_folder> = /data/tts - ไดเรกทอรีที่จะจัดเก็บชุดข้อมูล

ดาวน์โหลด ljspeech:

 cd <data_folder>
wget https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2
tar zxvf LJSpeech-1.1.tar.bz2

เตรียมชุดข้อมูล: <parent_folder> - โดยที่ EfficientsPeech ถูกโคลน GIT

 cd <parent_folder>/efficientspeech

แก้ไข config/LJSpeech/preprocess.yaml :

 >>>>>>>>>>>>>>>>>
path:
  corpus_path: "/data/tts/LJSpeech-1.1"
  lexicon_path: "lexicon/librispeech-lexicon.txt"
  raw_path: "/data/tts/LJSpeech-1.1/wavs"
  preprocessed_path: "./preprocessed_data/LJSpeech"
>>>>>>>>>>>>>>>>

แทนที่ /data/tts ด้วย <data_folder> ของคุณ

ดาวน์โหลดข้อมูลการจัดตำแหน่งเพื่อ preprocessed_data/LJSpeech/TextGrid จากที่นี่

เตรียมชุดข้อมูล:

 python3 prepare_align.py config/LJSpeech/preprocess.yaml

จะใช้เวลาหนึ่งชั่วโมงหรือมากกว่านั้น

สำหรับข้อมูลเพิ่มเติม: การใช้งาน FastSpeech2 เพื่อเตรียมชุดข้อมูล

รถไฟ

ES เล็ก

โดยค่าเริ่มต้น:

--precision=16 ตัวเลือกอื่น ๆ : "bf16-mixed", "16-mixed", 16, 32, 64
--accelerator=gpu
--infer-device=cuda
--devices=1
ดูตัวเลือกเพิ่มเติมใน utils/tools.py

 python3 train.py

ES ขนาดเล็ก

 python3 train.py --n-blocks 3 --reduction 2

ฐาน ES

 python3 train.py --head 2 --reduction 1 --expansion 2 --kernel-size 5 --n-blocks 3 --block-depth 3

เปรียบเทียบกับ TTS ระบบประสาท SOTA อื่น ๆ

es vs fs2 vs portaspeech vs lightspeech

การให้เครดิต

FastSpeech2 GitHub อย่างไม่เป็นทางการ

การอ้างอิง

หากคุณพบว่างานนี้มีประโยชน์โปรดอ้างอิง:

 @inproceedings{atienza2023efficientspeech,
  title={EfficientSpeech: An On-Device Text to Speech Model},
  author={Atienza, Rowel},
  booktitle={ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  pages={1--5},
  year={2023},
  organization={IEEE}
}

ขยาย

ข้อมูลเพิ่มเติม

เวอร์ชัน efficientspeech-0.2.1
ประเภท โค้ดแหล่งที่มา AI
เวลาอัปเดต 2025-08-21
ขนาด 4.85MB
มาจาก Github

แอปที่เกี่ยวข้อง

ML stack

2025-07-01
awesome free chatgpt

2025-01-04
pywin_contextmenu

2025-08-31
promptl

2025-02-17
tick.chat

2025-09-16
FastLoRAChat

2025-09-03

แนะนำสำหรับคุณ

chat.petals.dev

ซอร์สโค้ดอื่น ๆ

1.0.0
GPT Prompt Templates

ซอร์สโค้ดอื่น ๆ

1.0.0
GPTyped

ซอร์สโค้ดอื่น ๆ

GPTyped 1.0.5
ML stack

โค้ดแหล่งที่มา AI

1.0.0
awesome free chatgpt

โค้ดแหล่งที่มา AI

1.0.0
pywin_contextmenu

โค้ดแหล่งที่มา AI

Version update
Google Dorks

ซอร์สโค้ดอื่น ๆ

1.0
shepherd

ซอร์สโค้ดอื่น ๆ

v6.1.6-react-shepherd: Prepare Release (#3063)
mongo express

ซอร์สโค้ดอื่น ๆ

v1.1.0-rc-3

ข้อมูลที่เกี่ยวข้อง ทั้งหมด