TensorFlowTTS下载TensorFlowTTS源代码下载

TensorFlowTTS

Ai源码

v1.8

下载

？ Tensorflowtts

TensorFlow 2的实时最新语音综合2

？ TensorFlowTTS provides real-time state-of-the-art speech synthesis architectures such as Tacotron-2, Melgan, Multiband-Melgan, FastSpeech, FastSpeech2 based-on TensorFlow 2. With Tensorflow 2, we can speed-up training/inference progress, optimizer further by using fake-quantize aware and pruning, make TTS models can be run faster than real-time and be able to deploy on mobile devices or embedded系统。

什么是新的

2021/08/18（新！）与Gradio集成到拥抱面。请参阅Gradio Web演示。
2021/08/12（新！）支持法国TTS（Tacotron2，Multiband Melgan）。请参见colab。非常感谢塞缪尔·德拉莱斯
2021/06/01与HuggingFace Hub集成。见公关。感谢Patrickvonplaten和Osanseviero
2021/03/18 FastSpeech2和MB Melgan支持iOS。谢谢Kewlbear。请参阅此处
2021/01/18支持TFLITE C ++推理。感谢Luan78zaoha。请参阅此处
2020/12/02使用Thorsten数据集支持德国TT。见colab。感谢Thorstenmueller和Monatis
2020/11/24添加Hifi-Gan Vocoder。请参阅此处
2020/11/19添加多GPU梯度蓄能器。请参阅此处
2020/08/23添加并行Wavegan TensorFlow实现。请参阅此处
2020/08/20添加C ++推理代码。谢谢@zdisket。请参阅此处
2020/08/18更新新的基本处理器。添加自动处理器和验证的处理器JSON文件
2020/08/14支持中国TTS。请参见colab。谢谢@azraelkuan
2020/08/05支持韩国TTS。请参见colab。谢谢 @crux153
2017年7月20日，所有培训师的支持MultigPU
2020/07/05支持转换Tacotron-2，快速播放到Tflite。请参见colab。感谢TFLITE团队的@jaeyoo的支持
支持2020/06/20 FastSpeech2用TensorFlow实现。
2020/06/07 Multi-Band Melgan（MB Melgan）用TensorFlow的实现得到了支持

特征

语音合成的高性能。
能够微调其他语言。
快速，可扩展和可靠。
适合部署。
易于实现新模型，基于摘要类。
如果可能的话，可以加快训练的混合精度。
支持单/多GPU梯度积累。
在基本培训师类中支持单/多GPU。
所有受支持模型的TFLITE转换。
Android示例。
支持许多语言（目前，我们支持中文，韩语，英语，法语和德语）
支持C ++推断。
支持某些型号从Pytorch到TensorFlow的重量转换重量，以加速速度。

要求

该存储库在Ubuntu 18.04上进行了测试：

Python 3.7+
CUDA 10.1
Cudnn 7.6.5
Tensorflow 2.2/2.3/2.4/2.5/2.6
TensorFlow addon> = 0.10.0

不同的TensorFlow版本应起作用，但尚未测试。此存储库将尝试使用最新的稳定张量流。我们建议您在要使用MultigPU的情况下安装TensorFlow 2.6.0进行培训。

安装

与pip

$ pip install TensorFlowTTS

来自来源

示例包含在存储库中，但没有使用该框架。因此，要运行最新版本的示例，您需要在下面安装源。

$ git clone https://github.com/TensorSpeech/TensorFlowTTS.git
$ cd TensorFlowTTS
$ pip install .

如果您想升级存储库及其依赖项：

$ git pull
$ pip install --upgrade .

支持的模型体系结构

TensorFlowTTS当前提供以下架构：

梅尔根（Melgan）用纸质梅尔根（Melgan）发行：昆丹·库马尔（Kundan Kumar），里斯什·库马尔（Rithesh Kumar），蒂博（Thibault de Boissiere），卢卡斯·盖斯汀（Lucas Gestin），韦伊·齐·泰奥（Wei Zhen Teoh），何塞·索托洛（Jose Sotelo），乔斯·索托罗（Jose Sotelo），亚历山大·德·布雷伯斯（Alexandre de Brebisson），Yoshua Bengio，Aishua Bengio，Aaaron curville由梅尔根（Melgan）：有条件波形合成的生成对抗网络。
Tacotron-2 released with the paper Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions by Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, RJ Skerry-Ryan, Rif A. Saurous, Yannis Agiomyrgiannakis, Yonghui Wu。
FastSpech用纸张快速发行：Yi Ren，Yangjun Ruan，Xu Tan，Tao Qin，Sheng Zhao，Zhou Zhao，Zhou Zhao，tie-yan Liu的快速，健壮和可控文本。
多频段梅尔根（Melgan）与纸张多频梅尔根（Melgan）发行：Geng Yang，Shan Yang，Kai Liu，Peng Fang，Wei Chen，Lei Xie的高质量文本到语音的波形生成。
FastSpeech2用Paper FastSpeech 2：Yi Ren，Chenxu Hu，Xu Tan，Tao Qin，Sheng Zhao，Zhou Zhao，Zhou Zhao，Tie-Yan Liu发行的快速和高质量的端到端文本。
使用Paper Parallel Wavegan释放的平行波甘：基于生成对抗网络的快速波形生成模型，由Yamamoto Ryuichi，Eunwoo Song，Jae-Min Kim的多分辨率光谱图。
HIFI-GAN与Paper Hifi-Gan发行：Jungil Kong，Jaehyheon Kim，Jaekyoung Bae的生成对抗网络，以实现高效和高保真言语合成。

我们还正在实施一些技术，以提高以下论文的质量和收敛速度：

通过纸张有效训练的文本到语音系统，基于深层卷积网络的指导性丧失，并由hideyuki tachibana，katsuya uenoyama，shunsuke aihara引起了人们的关注。

音频样本

在有效集中的音频样本中。 Tacotron-2，FastSpeech，Melgan，Melgan.stft，fastspeech2，Multiband_melgan

教程端到端

准备数据集

以以下格式准备数据集：

 |- [NAME_DATASET]/
|   |- metadata.csv
|   |- wavs/
|       |- file1.wav
|       |- ...

其中metadata.csv具有以下格式： id|transcription 。这是一种类似于ljspeech的格式；如果您有其他格式数据集，则可以忽略预处理步骤。

请注意， NAME_DATASET应该是[ljspeech/kss/baker/libritts/synpaflex] 。

预处理

预处理有两个步骤：

预处理音频功能
- 将字符转换为ID
- 计算MEL频谱图
- 将MEL频谱图标准化为[-1，1]范围
- 将数据集分为火车和验证
- 计算多个特征与训练分开的平均值和标准偏差
基于计算的统计数据标准化MEL频谱图

复制上述步骤：

 tensorflow-tts-preprocess --rootdir ./[ljspeech/kss/baker/libritts/thorsten/synpaflex] --outdir ./dump_[ljspeech/kss/baker/libritts/thorsten/synpaflex] --config preprocess/[ljspeech/kss/baker/thorsten/synpaflex]_preprocess.yaml --dataset [ljspeech/kss/baker/libritts/thorsten/synpaflex]
tensorflow-tts-normalize --rootdir ./dump_[ljspeech/kss/baker/libritts/thorsten/synpaflex] --outdir ./dump_[ljspeech/kss/baker/libritts/thorsten/synpaflex] --config preprocess/[ljspeech/kss/baker/libritts/thorsten/synpaflex]_preprocess.yaml --dataset [ljspeech/kss/baker/libritts/thorsten/synpaflex]

目前，我们仅支持ljspeech ， kss ， baker ， libritts ， thorsten和synpaflex用于数据集参数。将来，我们打算支持更多数据集。

注意：要运行libritts预处理，请先在示例/fastspeech2_libritts中首先阅读指令。在进行预处理之前，我们需要先对其进行重新格式化。

注意：要运行synpaflex预处理，请首先运行笔记本电脑/prepar_synpaflex.ipynb。在进行预处理之前，我们需要先对其进行重新格式化。

预处理后，项目文件夹的结构应为：

 |- [NAME_DATASET]/
|   |- metadata.csv
|   |- wav/
|       |- file1.wav
|       |- ...
|- dump_[ljspeech/kss/baker/libritts/thorsten]/
|   |- train/
|       |- ids/
|           |- LJ001-0001-ids.npy
|           |- ...
|       |- raw-feats/
|           |- LJ001-0001-raw-feats.npy
|           |- ...
|       |- raw-f0/
|           |- LJ001-0001-raw-f0.npy
|           |- ...
|       |- raw-energies/
|           |- LJ001-0001-raw-energy.npy
|           |- ...
|       |- norm-feats/
|           |- LJ001-0001-norm-feats.npy
|           |- ...
|       |- wavs/
|           |- LJ001-0001-wave.npy
|           |- ...
|   |- valid/
|       |- ids/
|           |- LJ001-0009-ids.npy
|           |- ...
|       |- raw-feats/
|           |- LJ001-0009-raw-feats.npy
|           |- ...
|       |- raw-f0/
|           |- LJ001-0001-raw-f0.npy
|           |- ...
|       |- raw-energies/
|           |- LJ001-0001-raw-energy.npy
|           |- ...
|       |- norm-feats/
|           |- LJ001-0009-norm-feats.npy
|           |- ...
|       |- wavs/
|           |- LJ001-0009-wave.npy
|           |- ...
|   |- stats.npy
|   |- stats_f0.npy
|   |- stats_energy.npy
|   |- train_utt_ids.npy
|   |- valid_utt_ids.npy
|- examples/
|   |- melgan/
|   |- fastspeech/
|   |- tacotron2/
|   ...

stats.npy包含训练分谱图中的平均值和std
stats_energy.npy包含训练分配中能量值的平均值和性病
stats_f0.npy包含训练拆分中F0值的平均值和std
train_utt_ids.npy / valid_utt_ids.npy分别包含培训和验证说法ID

我们为每种输入类型使用后缀（ ids ， raw-feats ， raw-energy ， raw-f0 ， norm-feats和wave ）。

重要说明：

此预处理步骤基于ESPNET，因此您可以将所有模型与ESPNET存储库中的其他模型相结合。
无论您如何格式化数据集， dump文件夹的最终结构都应遵循上述结构，以便能够使用培训脚本，或者您可以自己修改？

培训模型

要了解如何从头开始训练模型或与其他数据集/语言进行微调，请在示例目录中查看详细信息。

对于Tacotron-2教程，请参见示例/TACOTRON2
对于FastSpeech教程，请参见示例/FastSpeech
对于FastSpeech2教程，请参见示例/fastspeech2
对于FastSpeech2 + MFA教程，请参见示例/fastspeech2_libritts
对于梅尔根教程，请参见示例/梅尔根
对于梅尔根 + STFT损失教程，请参见示例/梅尔根。
对于多班式梅尔根教程，请参见示例/Multiband_melgan
对于平行的Wavegan教程，请参见示例/Parallel_Wavegan
对于多班级 - 梅尔根生成器 + Hifi-gan教程，请参见示例/MULTIBAND_MELGAN_HF
对于Hifi-Gan教程，请参见示例/hifigan

抽象类解释

摘要基于TensorFlow的数据集

来自Tensorflow_tts/DataSet/Abstract_dataset的抽象数据集类的详细实现。您需要有些功能过度并理解：

GET_ARGS ：发电机类的此函数返回论证，通常是utt_ids。
生成器：此功能具有来自get_args函数的输入，并返回模型的输入。请注意，我们使用与模型参数完全匹配的键返回所有发电机函数的字典，因为base_trainer将使用模型（** batch）进行前进步骤。
get_output_dtypes ：此函数需要从发电机函数中为每个元素返回dtypes。
GET_LEN_DATASET ：数据集的返回LEN，正常是Len（utt_ids）。

重要说明：

创建数据集的管道应为：缓存 - > shuffle-> map_fn-> get_batch-> prefetch。
如果您在缓存之前进行散装，则数据集在数据集上重新征用时不会散步。
您应该应用map_fn，以使从发电机函数返回的每个元素在获得批处理之前的长度相同，然后将其送入模型。

使用此Abstract_dataset的一些示例是tacotron_dataset.py，fastspeech_dataset.py，melgan_dataset.py，fastspech2_dataset.pypy

抽象培训师课

tensorflow_tts/trainer/base_trainer.py的base_trainer的详细实现。它包括Seq2SeqbasedTrainer和GanbasedTrainer从基于培训器继承。所有培训师都支持单个/多GPU。实现new_trainer时，您必须过度一些功能：

编译：此功能旨在定义模型和损失。
generate_and_save_intermediate_result ：此函数将保存中间结果，例如：绘图对齐，保存产生的音频，绘图MEL-SPECTROGRAM ...
COMPUTE_PER_EXAMPLE_LOSSES ：此功能将计算模型的PER_EXAMPLE_LOSS，请注意，损失的所有元素都必须具有形状[batch_size]。

此存储库上的所有模型均经过培训的ganbasedTrainer （请参阅train_melgan.py，train_melgan_stft.py，triar_multiband_melgan.py）和seq2seqbassedtrainer （请参阅Train_tacotron2.py，Train_tacotron2.py，Train_fast_fastspeech.py）。

端到端示例

您可以知道如何在笔记本上推断每个模型，或者查看Colab（用于英语），Colab（用于韩语），Colab（用于中文），Colab（法语），Colab（用于德语）。这是用于使用FastSpeech2和Multi Band Melgan的End2End推断的示例代码。我们在拥抱面枢纽上上传了所有审慎的介绍。

 import numpy as np
import soundfile as sf
import yaml

import tensorflow as tf

from tensorflow_tts . inference import TFAutoModel
from tensorflow_tts . inference import AutoProcessor

# initialize fastspeech2 model.
fastspeech2 = TFAutoModel . from_pretrained ( "tensorspeech/tts-fastspeech2-ljspeech-en" )


# initialize mb_melgan model
mb_melgan = TFAutoModel . from_pretrained ( "tensorspeech/tts-mb_melgan-ljspeech-en" )


# inference
processor = AutoProcessor . from_pretrained ( "tensorspeech/tts-fastspeech2-ljspeech-en" )

input_ids = processor . text_to_sequence ( "Recent research at Harvard has shown meditating for as little as 8 weeks, can actually increase the grey matter in the parts of the brain responsible for emotional regulation, and learning." )
# fastspeech inference

mel_before , mel_after , duration_outputs , _ , _ = fastspeech2 . inference (
    input_ids = tf . expand_dims ( tf . convert_to_tensor ( input_ids , dtype = tf . int32 ), 0 ),
    speaker_ids = tf . convert_to_tensor ([ 0 ], dtype = tf . int32 ),
    speed_ratios = tf . convert_to_tensor ([ 1.0 ], dtype = tf . float32 ),
    f0_ratios = tf . convert_to_tensor ([ 1.0 ], dtype = tf . float32 ),
    energy_ratios = tf . convert_to_tensor ([ 1.0 ], dtype = tf . float32 ),
)

# melgan inference
audio_before = mb_melgan . inference ( mel_before )[ 0 , :, 0 ]
audio_after = mb_melgan . inference ( mel_after )[ 0 , :, 0 ]

# save to file
sf . write ( './audio_before.wav' , audio_before , 22050 , "PCM_16" )
sf . write ( './audio_after.wav' , audio_after , 22050 , "PCM_16" )