TensorFlowTTS下載TensorFlowTTS源代碼下載

TensorFlowTTS

Ai源碼

v1.8

下載

？ Tensorflowtts

TensorFlow 2的實時最新語音綜合2

？ TensorFlowTTS provides real-time state-of-the-art speech synthesis architectures such as Tacotron-2, Melgan, Multiband-Melgan, FastSpeech, FastSpeech2 based-on TensorFlow 2. With Tensorflow 2, we can speed-up training/inference progress, optimizer further by using fake-quantize aware and pruning, make TTS models can be run faster than real-time and be able to deploy on mobile devices or embedded系統。

什麼是新的

2021/08/18（新！）與Gradio集成到擁抱面。請參閱Gradio Web演示。
2021/08/12（新！）支持法國TTS（Tacotron2，Multiband Melgan）。請參見colab。非常感謝塞繆爾·德拉萊斯
2021/06/01與HuggingFace Hub集成。見公關。感謝Patrickvonplaten和Osanseviero
2021/03/18 FastSpeech2和MB Melgan支持iOS。謝謝Kewlbear。請參閱此處
2021/01/18支持TFLITE C ++推理。感謝Luan78zaoha。請參閱此處
2020/12/02使用Thorsten數據集支持德國TT。見colab。感謝Thorstenmueller和Monatis
2020/11/24添加Hifi-Gan Vocoder。請參閱此處
2020/11/19添加多GPU梯度蓄能器。請參閱此處
2020/08/23添加並行Wavegan TensorFlow實現。請參閱此處
2020/08/20添加C ++推理代碼。謝謝@zdisket。請參閱此處
2020/08/18更新新的基本處理器。添加自動處理器和驗證的處理器JSON文件
2020/08/14支持中國TTS。請參見colab。謝謝@azraelkuan
2020/08/05支持韓國TTS。請參見colab。謝謝 @crux153
2017年7月20日，所有培訓師的支持MultigPU
2020/07/05支持轉換Tacotron-2，快速播放到Tflite。請參見colab。感謝TFLITE團隊的@jaeyoo的支持
支持2020/06/20 FastSpeech2用TensorFlow實現。
2020/06/07 Multi-Band Melgan（MB Melgan）用TensorFlow的實現得到了支持

特徵

語音合成的高性能。
能夠微調其他語言。
快速，可擴展和可靠。
適合部署。
易於實現新模型，基於摘要類。
如果可能的話，可以加快訓練的混合精度。
支持單/多GPU梯度積累。
在基本培訓師類中支持單/多GPU。
所有受支持模型的TFLITE轉換。
Android示例。
支持許多語言（目前，我們支持中文，韓語，英語，法語和德語）
支持C ++推斷。
支持某些型號從Pytorch到TensorFlow的重量轉換重量，以加速速度。

要求

該存儲庫在Ubuntu 18.04上進行了測試：

Python 3.7+
CUDA 10.1
Cudnn 7.6.5
Tensorflow 2.2/2.3/2.4/2.5/2.6
TensorFlow addon> = 0.10.0

不同的TensorFlow版本應起作用，但尚未測試。此存儲庫將嘗試使用最新的穩定張量流。我們建議您在要使用MultigPU的情況下安裝TensorFlow 2.6.0進行培訓。

安裝

與pip

$ pip install TensorFlowTTS

來自來源

示例包含在存儲庫中，但沒有使用該框架。因此，要運行最新版本的示例，您需要在下面安裝源。

$ git clone https://github.com/TensorSpeech/TensorFlowTTS.git
$ cd TensorFlowTTS
$ pip install .

如果您想升級存儲庫及其依賴項：

$ git pull
$ pip install --upgrade .

支持的模型體系結構

TensorFlowTTS當前提供以下架構：

梅爾根（Melgan）用紙質梅爾根（Melgan）發行：昆丹·庫馬爾（Kundan Kumar），里斯什·庫馬爾（Rithesh Kumar），蒂博（Thibault de Boissiere），盧卡斯·蓋斯汀（Lucas Gestin），韋伊·齊·泰奧（Wei Zhen Teoh），何塞·索托洛（Jose Sotelo），喬斯·索托羅（Jose Sotelo），亞歷山大·德·布雷伯斯（Alexandre de Brebisson），Yoshua Bengio，Aishua Bengio，Aaaron curville由梅爾根（Melgan）：有條件波形合成的生成對抗網絡。
Tacotron-2 released with the paper Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions by Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, RJ Skerry-Ryan, Rif A. Saurous, Yannis Agiomyrgiannakis, Yonghui Wu。
FastSpech用紙張快速發行：Yi Ren，Yangjun Ruan，Xu Tan，Tao Qin，Sheng Zhao，Zhou Zhao，Zhou Zhao，tie-yan Liu的快速，健壯和可控文本。
多頻段梅爾根（Melgan）與紙張多頻梅爾根（Melgan）發行：Geng Yang，Shan Yang，Kai Liu，Peng Fang，Wei Chen，Lei Xie的高質量文本到語音的波形生成。
FastSpeech2用Paper FastSpeech 2：Yi Ren，Chenxu Hu，Xu Tan，Tao Qin，Sheng Zhao，Zhou Zhao，Zhou Zhao，Tie-Yan Liu發行的快速和高質量的端到端文本。
使用Paper Parallel Wavegan釋放的平行波甘：基於生成對抗網絡的快速波形生成模型，由Yamamoto Ryuichi，Eunwoo Song，Jae-Min Kim的多分辨率光譜圖。
HIFI-GAN與Paper Hifi-Gan發行：Jungil Kong，Jaehyheon Kim，Jaekyoung Bae的生成對抗網絡，以實現高效和高保真言語合成。

我們還正在實施一些技術，以提高以下論文的質量和收斂速度：

通過紙張有效訓練的文本到語音系統，基於深層卷積網絡的指導性喪失，並由hideyuki tachibana，katsuya uenoyama，shunsuke aihara引起了人們的關注。

音頻樣本

在有效集中的音頻樣本中。 Tacotron-2，FastSpeech，Melgan，Melgan.stft，fastspeech2，Multiband_melgan

教程端到端

準備數據集

以以下格式準備數據集：

 |- [NAME_DATASET]/
|   |- metadata.csv
|   |- wavs/
|       |- file1.wav
|       |- ...

其中metadata.csv具有以下格式： id|transcription 。這是一種類似於ljspeech的格式；如果您有其他格式數據集，則可以忽略預處理步驟。

請注意， NAME_DATASET應該是[ljspeech/kss/baker/libritts/synpaflex] 。

預處理

預處理有兩個步驟：

預處理音頻功能
- 將字符轉換為ID
- 計算MEL頻譜圖
- 將MEL頻譜圖標準化為[-1，1]範圍
- 將數據集分為火車和驗證
- 計算多個特徵與訓練分開的平均值和標準偏差
基於計算的統計數據標準化MEL頻譜圖

複製上述步驟：

 tensorflow-tts-preprocess --rootdir ./[ljspeech/kss/baker/libritts/thorsten/synpaflex] --outdir ./dump_[ljspeech/kss/baker/libritts/thorsten/synpaflex] --config preprocess/[ljspeech/kss/baker/thorsten/synpaflex]_preprocess.yaml --dataset [ljspeech/kss/baker/libritts/thorsten/synpaflex]
tensorflow-tts-normalize --rootdir ./dump_[ljspeech/kss/baker/libritts/thorsten/synpaflex] --outdir ./dump_[ljspeech/kss/baker/libritts/thorsten/synpaflex] --config preprocess/[ljspeech/kss/baker/libritts/thorsten/synpaflex]_preprocess.yaml --dataset [ljspeech/kss/baker/libritts/thorsten/synpaflex]

目前，我們僅支持ljspeech ， kss ， baker ， libritts ， thorsten和synpaflex用於數據集參數。將來，我們打算支持更多數據集。

注意：要運行libritts預處理，請先在示例/fastspeech2_libritts中首先閱讀指令。在進行預處理之前，我們需要先對其進行重新格式化。

注意：要運行synpaflex預處理，請首先運行筆記本電腦/prepar_synpaflex.ipynb。在進行預處理之前，我們需要先對其進行重新格式化。

預處理後，項目文件夾的結構應為：

 |- [NAME_DATASET]/
|   |- metadata.csv
|   |- wav/
|       |- file1.wav
|       |- ...
|- dump_[ljspeech/kss/baker/libritts/thorsten]/
|   |- train/
|       |- ids/
|           |- LJ001-0001-ids.npy
|           |- ...
|       |- raw-feats/
|           |- LJ001-0001-raw-feats.npy
|           |- ...
|       |- raw-f0/
|           |- LJ001-0001-raw-f0.npy
|           |- ...
|       |- raw-energies/
|           |- LJ001-0001-raw-energy.npy
|           |- ...
|       |- norm-feats/
|           |- LJ001-0001-norm-feats.npy
|           |- ...
|       |- wavs/
|           |- LJ001-0001-wave.npy
|           |- ...
|   |- valid/
|       |- ids/
|           |- LJ001-0009-ids.npy
|           |- ...
|       |- raw-feats/
|           |- LJ001-0009-raw-feats.npy
|           |- ...
|       |- raw-f0/
|           |- LJ001-0001-raw-f0.npy
|           |- ...
|       |- raw-energies/
|           |- LJ001-0001-raw-energy.npy
|           |- ...
|       |- norm-feats/
|           |- LJ001-0009-norm-feats.npy
|           |- ...
|       |- wavs/
|           |- LJ001-0009-wave.npy
|           |- ...
|   |- stats.npy
|   |- stats_f0.npy
|   |- stats_energy.npy
|   |- train_utt_ids.npy
|   |- valid_utt_ids.npy
|- examples/
|   |- melgan/
|   |- fastspeech/
|   |- tacotron2/
|   ...

stats.npy包含訓練分譜圖中的平均值和std
stats_energy.npy包含訓練分配中能量值的平均值和性病
stats_f0.npy包含訓練拆分中F0值的平均值和std
train_utt_ids.npy / valid_utt_ids.npy分別包含培訓和驗證說法ID

我們為每種輸入類型使用後綴（ ids ， raw-feats ， raw-energy ， raw-f0 ， norm-feats和wave ）。

重要說明：

此預處理步驟基於ESPNET，因此您可以將所有模型與ESPNET存儲庫中的其他模型相結合。
無論您如何格式化數據集， dump文件夾的最終結構都應遵循上述結構，以便能夠使用培訓腳本，或者您可以自己修改？

培訓模型

要了解如何從頭開始訓練模型或與其他數據集/語言進行微調，請在示例目錄中查看詳細信息。

對於Tacotron-2教程，請參見示例/TACOTRON2
對於FastSpeech教程，請參見示例/FastSpeech
對於FastSpeech2教程，請參見示例/fastspeech2
對於FastSpeech2 + MFA教程，請參見示例/fastspeech2_libritts
對於梅爾根教程，請參見示例/梅爾根
對於梅爾根 + STFT損失教程，請參見示例/梅爾根。
對於多班式梅爾根教程，請參見示例/Multiband_melgan
對於平行的Wavegan教程，請參見示例/Parallel_Wavegan
對於多班級 - 梅爾根生成器 + Hifi-gan教程，請參見示例/MULTIBAND_MELGAN_HF
對於Hifi-Gan教程，請參見示例/hifigan

抽像類解釋

摘要基於TensorFlow的數據集

來自Tensorflow_tts/DataSet/Abstract_dataset的抽像數據集類的詳細實現。您需要有些功能過度並理解：

GET_ARGS ：發電機類的此函數返回論證，通常是utt_ids。
生成器：此功能具有來自get_args函數的輸入，並返回模型的輸入。請注意，我們使用與模型參數完全匹配的鍵返回所有發電機函數的字典，因為base_trainer將使用模型（** batch）進行前進步驟。
get_output_dtypes ：此函數需要從發電機函數中為每個元素返回dtypes。
GET_LEN_DATASET ：數據集的返回LEN，正常是Len（utt_ids）。

重要說明：

創建數據集的管道應為：緩存 - > shuffle-> map_fn-> get_batch-> prefetch。
如果您在緩存之前進行散裝，則數據集在數據集上重新徵用時不會散步。
您應該應用map_fn，以使從發電機函數返回的每個元素在獲得批處理之前的長度相同，然後將其送入模型。

使用此Abstract_dataset的一些示例是tacotron_dataset.py，fastspeech_dataset.py，melgan_dataset.py，fastspech2_dataset.pypy

抽象培訓師課

tensorflow_tts/trainer/base_trainer.py的base_trainer的詳細實現。它包括Seq2SeqbasedTrainer和GanbasedTrainer從基於培訓器繼承。所有培訓師都支持單個/多GPU。實現new_trainer時，您必須過度一些功能：

編譯：此功能旨在定義模型和損失。
generate_and_save_intermediate_result ：此函數將保存中間結果，例如：繪圖對齊，保存產生的音頻，繪圖MEL-SPECTROGRAM ...
COMPUTE_PER_EXAMPLE_LOSSES ：此功能將計算模型的PER_EXAMPLE_LOSS，請注意，損失的所有元素都必須具有形狀[batch_size]。

此存儲庫上的所有模型均經過培訓的ganbasedTrainer （請參閱train_melgan.py，train_melgan_stft.py，triar_multiband_melgan.py）和seq2seqbassedtrainer （請參閱Train_tacotron2.py，Train_tacotron2.py，Train_fast_fastspeech.py）。

端到端示例

您可以知道如何在筆記本上推斷每個模型，或者查看Colab（用於英語），Colab（用於韓語），Colab（用於中文），Colab（法語），Colab（用於德語）。這是用於使用FastSpeech2和Multi Band Melgan的End2End推斷的示例代碼。我們在擁抱面樞紐上上傳了所有審慎的介紹。

 import numpy as np
import soundfile as sf
import yaml

import tensorflow as tf

from tensorflow_tts . inference import TFAutoModel
from tensorflow_tts . inference import AutoProcessor

# initialize fastspeech2 model.
fastspeech2 = TFAutoModel . from_pretrained ( "tensorspeech/tts-fastspeech2-ljspeech-en" )


# initialize mb_melgan model
mb_melgan = TFAutoModel . from_pretrained ( "tensorspeech/tts-mb_melgan-ljspeech-en" )


# inference
processor = AutoProcessor . from_pretrained ( "tensorspeech/tts-fastspeech2-ljspeech-en" )

input_ids = processor . text_to_sequence ( "Recent research at Harvard has shown meditating for as little as 8 weeks, can actually increase the grey matter in the parts of the brain responsible for emotional regulation, and learning." )
# fastspeech inference

mel_before , mel_after , duration_outputs , _ , _ = fastspeech2 . inference (
    input_ids = tf . expand_dims ( tf . convert_to_tensor ( input_ids , dtype = tf . int32 ), 0 ),
    speaker_ids = tf . convert_to_tensor ([ 0 ], dtype = tf . int32 ),
    speed_ratios = tf . convert_to_tensor ([ 1.0 ], dtype = tf . float32 ),
    f0_ratios = tf . convert_to_tensor ([ 1.0 ], dtype = tf . float32 ),
    energy_ratios = tf . convert_to_tensor ([ 1.0 ], dtype = tf . float32 ),
)

# melgan inference
audio_before = mb_melgan . inference ( mel_before )[ 0 , :, 0 ]
audio_after = mb_melgan . inference ( mel_after )[ 0 , :, 0 ]

# save to file
sf . write ( './audio_before.wav' , audio_before , 22050 , "PCM_16" )
sf . write ( './audio_after.wav' , audio_after , 22050 , "PCM_16" )