efficientspeechダウンロード - efficientspeechソースコードのダウンロード

efficientspeech

AI ソースコード

efficientspeech-0.2.1

ダウンロード

EfficientsPeech：スピーチモデルからデバイスのテキスト

Efficientspeech 、またはES略して、効率的な神経テキストから音声（TTS）モデルです。 RPI4の104（MRTF）または104秒の音声でMELスペクトログラムを生成します。その小さなバージョンには、わずか266kのパラメーターのフットプリントがあります。これは、Mixerttsなどの現代のTTSの約1％のみです。 6秒の音声を生成すると、90 Mflopsのみが消費されます。

紙

IEEE XPLORE
arxiv

モデルアーキテクチャ

efficientspeechは、U-netに似た浅い（2ブロック！）ピラミッド変圧器です。アップサンプリングは、転置された深さの分離可能な畳み込みによって行われます。

クイックデモ

インストール

ESは現在、Pytorch 2.0およびLightning 2.0に移行しています。不安定な機能を期待してください。

 pip install -r requirements.txt

Cublasで問題が発生した場合：

 pip uninstall nvidia_cublas_cu11

小さなes

 python3 demo.py --checkpoint https://github.com/roatienza/efficientspeech/releases/download/pytorch2.0.1/tiny_eng_266k.ckpt 
  --infer-device cpu --text "the quick brown fox jumps over the lazy dog" --wav-filename fox.wav

出力ファイルはoutputsの下です。 WAVファイルを再生します：

 ffplay outputs/fox.wav

ウェイトをダウンロードした後、再利用できます。

 python3 demo.py --checkpoint tiny_eng_266k.ckpt --infer-device cpu  
  --text "In additive color mixing, which is used for displays such as computer screens and televisions, the primary colors are red, green, and blue." 
  --wav-filename color.wav

再生：

 ffplay outputs/color.wav

小さなes

 python3 demo.py --checkpoint https://github.com/roatienza/efficientspeech/releases/download/pytorch2.0.1/small_eng_952k.ckpt 
  --infer-device cpu  --n-blocks 3 --reduction 2  
  --text "Bees are essential pollinators responsible for fertilizing plants and facilitating the growth of fruits, vegetables, and flowers. Their sophisticated social structures and intricate communication systems make them fascinating and invaluable contributors to ecosystems worldwide." 
  --wav-filename bees.wav

再生：

 ffplay outputs/color-small.wav

ベースES

 python3 demo.py --checkpoint https://github.com/roatienza/efficientspeech/releases/download/pytorch2.0.1/base_eng_4M.ckpt 
  --head 2 --reduction 1 --expansion 2 --kernel-size 5 --n-blocks 3 --block-depth 3 --infer-device cpu  
  --text "Why do bees have sticky hair?" --wav-filename  bees-base.wav

再生：

 ffplay outputs/bees-base.wav

推論のためのGPU

そして長いテキストで。 A100では、RTF> 1,300に達する可能性があります。 --iter 100オプションを使用する時間。

 python3 demo.py --checkpoint small_eng_952k.ckpt  
  --infer-device cuda  --n-blocks 3 --reduction 2  
  --text "Once upon a time, in a magical forest filled with colorful flowers and sparkling streams, there lived a group of adorable kittens. Their names were Fluffy, Sparkle, and Whiskers. With their soft fur and twinkling eyes, they charmed everyone they met. Every day, they would play together, chasing their tails and pouncing on sunbeams that danced through the trees. Their purrs filled the forest with joy, and all the woodland creatures couldn't help but smile whenever they saw the cute trio. The animals knew that these kittens were truly the epitome of cuteness, bringing happiness wherever they went."   
  --wav-filename cats.wav --iter 100

スレッドオプションのコンパイルと数

コンパイルされたオプションは、トレーニングまたは推論中に--compileを使用してサポートされています。トレーニングのために、熱心なモードはより速いです。小さなバージョンのトレーニングは、A100で約17時間です。推論の場合、コンパイルされたバージョンはより高速です。理由は未知の理由で、コンパイルオプションは--infer-device cudaの場合にエラーを生成しています。

デフォルトでは、Pytorch 2.0は128のCPUスレッド（AMD、RPI4で4）を使用して、推論中に減速を引き起こします。推論中は、それをより低い数に設定することをお勧めします。例： --threads 24 。

RPI4推論

Pytorch 2.0はRPI4で遅くなります。デモリリースとICASSP2023モデルの重みを使用してください。

Pytorch 2.0のRTFは〜1.0です。 Pytorch 1.12のRTFは〜1.7です。

または、ONNXバージョンを使用してください。

 python3 demo.py --checkpoint https://github.com/roatienza/efficientspeech/releases/download/pytorch2.0.1/tiny_eng_266k.onnx 
  --infer-device cpu  --text "the primary colors are red, green, and blue."  --wav-filename primary.wav

onnx

固定入力音素長のみをサポートします。必要に応じて、パディングまたは切り捨てが適用されます。 --onnx-insize=<desired valu>を使用して変更します。デフォルトの最大音素の長さは128です。たとえば、：

 python3 convert.py --checkpoint tiny_eng_266k.ckpt --onnx tiny_eng_266k.onnx --onnx-insize 256

データセットの準備

データセットフォルダーを選択します：eg <data_folder> = /data/tts -Directoryデータセットが保存される場合。

ljspeechをダウンロード：

 cd <data_folder>
wget https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2
tar zxvf LJSpeech-1.1.tar.bz2

データセットを準備： <parent_folder> -efficientspeechがgitクローン化された場所。

 cd <parent_folder>/efficientspeech

config/LJSpeech/preprocess.yamlの編集：

 >>>>>>>>>>>>>>>>>
path:
  corpus_path: "/data/tts/LJSpeech-1.1"
  lexicon_path: "lexicon/librispeech-lexicon.txt"
  raw_path: "/data/tts/LJSpeech-1.1/wavs"
  preprocessed_path: "./preprocessed_data/LJSpeech"
>>>>>>>>>>>>>>>>

/data/tts <data_folder>に置き換えます。

ここから、Alignmentデータをpreprocessed_data/LJSpeech/TextGridにダウンロードします。

データセットを準備します：

 python3 prepare_align.py config/LJSpeech/preprocess.yaml

これには1時間ほどかかります。

詳細については、データセットを準備するためのFastSpeech2実装。

電車

小さなes

デフォルト：

--precision=16 。その他のオプション： "bf16-mixed", "16-mixed", 16, 32, 64 。
--accelerator=gpu
--infer-device=cuda
--devices=1
utils/tools.pyのその他のオプションをご覧ください

 python3 train.py

小さなes

 python3 train.py --n-blocks 3 --reduction 2

ベースES

 python3 train.py --head 2 --reduction 1 --expansion 2 --kernel-size 5 --n-blocks 3 --block-depth 3

他のSOTA神経TTとの比較

ES対FS2対PartaSpeech vs Lightspeech

クレジット

fastspeech2非公式のgithub。

引用

この作業が便利だと思う場合は、引用してください。

 @inproceedings{atienza2023efficientspeech,
  title={EfficientSpeech: An On-Device Text to Speech Model},
  author={Atienza, Rowel},
  booktitle={ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  pages={1--5},
  year={2023},
  organization={IEEE}
}

拡大する

追加情報