WavThruVec_pytorch Download - WavThruVec_pytorch Source code download

WavThruVec_pytorch

AI Source Code

1.0.0

Download

WavThruVec Pytorch

An Unofficial Implementation of WavThruVec Based on Pytorch.

The original paper is WavThruVec: Latent speech representation as intermediate features for neural speech synthesis

architecture

The Text2Vec model mostly follows the fastspeech (xcmyz's) architecture. I modified the model, mainly based on rad-tts (nvidia's). And I add an ECAPA_TDNN as speaker encoder, for multi-speaker condition.

For other details not mentioned in the paper, I also follow the rad-tts.

The Vec2Wav is mostly based on the hifi-gan, and introduce Conditional Batch Normalization to condition the network on the speaker embedding. The upsample rates sequence is (5,4,4,2,2) so the upsampling factor is $times 320$ (original paper is $times 640$), in other words, the generated WAVs have a sample rate of 16khz (32khz in original paper),.

text2vec training

text2vec inference

vec2wav

Input

for text:

Do not use any rule-based text normalization or phonemization methods, but feed raw character and transform to text-embedding as inputs.

for audio:

Use wav2vec 2.0's output as the wav's feature(instead of mel spectrogram), with a dtype of 'float32' and a shape of (batch_size, n_frame, n_channel).

note: n_channel=768 or 1024, it depends on which version of the wav2vec 2.0 pretrained model you are using, because TencentGameMate provide fairseq-version(768) and huggingface-version(1024). These two version has different output shape.

wav2vec 2.0 pretrained

From this repository wav2vec2.0 (chinese speech pretrain), and it can also be found at huggingface

attn_prior

One of the biggest difference between WavThruVec and FastSpeech is the monotonic alignment search(MAS) module (refer to the alignment.py).

In FastSpeech, the training inputs include Teacher-Forcing Alignment for mel frames and text tokens. Specifically, it involves using MFA to generate the duration of mel frames for each text token before training.

While in WavThruVec, the duration is generated using the MAS from the rad-tts, and is fed into the LengthRegulator(DurationPredictor).

According to monotonic alignment search and rad-tts implementation, when you training the model, align-prior files would be generated under './data/align_prior' directory, with the file name format of {n_token}_{n_feat}_prior.pth.

environment

CUDA 10.1
python 3.9.7
torch 1.8.1+cu101
torch-optimizer 0.3.0
torchaudio 0.8.1
tensorboard 2.12.0
librosa 0.8.0
numba 0.56.4
numpy 1.22.4
llvmlite 0.39.1

dataset and prepare

aishell3

The prepare_data.py:

1.read the wav files and wav2vec2 pretrained model, resample the wavs to 16khz, and convert to .npy files, which contrain the corresponding wav2vec 2.0 feature.
2.read the aishell3 transcription(content.txt), and filter the Chinese phoneme and blank. Take the transcription and file path to build the train list(./data/enc_train.txt).
3.build the vocab, which will be used to convert the characters to torch Variable.

As an example, prepare_data.py only take a few speakers and a few wav files.

training

WavThruVec contrains 2 components: Text2Vec(encoder) and Vec2Wav(decoder), and they train independently

Thus, I placed them in two separate dirs and used different training configurations for each.

TensorBoard

The TensorBoard loggers are stored in the run/{log_seed}/tb_logs directory. Suppose log_seed=1, you can use this command to serve the TensorBoard on your localhost.

tensorboard --logdir run/1/tb_logs

save checkpoint and restore

The model checkpoints are saved in the run/{log_seed}/model_new directory.

Suppose you save checkpoints every 10000 iterations, and now you have a checkpoint checkpoint_10000.pth.tar. If you need to restart training at step 10000, then use this command.

python ./text2vec/train.py --restore_step 10000

Todo

experiment & Performace
More details for implementation

Reference

Repository

fastspeech (xcmyz's)
wav2vec2.0 (chinese speech pretrain)
rad-tts (nvidia's)
gan-tts (yanggeng1995's)
hifi-gan
Fastpitch (dan-wells')
ecapa_tdnn (Tao Ruijie's)
ecapa_tdnn (lawlict's)
glow-tts (jaywalnut310's)

Paper

FastSpeech
FastSpeech2
hifi-gan
wav2vec
rad-tts
monotonic alignment search

Expand

Additional Information

Version 1.0.0
Type AI Source Code
Update Time 2025-09-14
size 892.77KB
From Github

Related Applications

OpenCore_NO_ACPI_Build

2024-11-13
nspanel_pro_tools_apk

2024-11-12
zkwork_aleo_gpu_worker

2024-11-11
pytorch image models

2024-11-03
nextcloud_share_url_downloader

2024-11-01
Lihua data analysis engine free version 3.0_search_navigation_collection_public opinion_ranking_api

2022-06-28

Recommended for You

chat.petals.dev

Other source code

1.0.0
GPT Prompt Templates

Other source code

1.0.0
GPTyped

Other source code

GPTyped 1.0.5
ML stack

AI Source Code

1.0.0
awesome free chatgpt

AI Source Code

1.0.0
pywin_contextmenu

AI Source Code

Version update
Google Dorks

Other source code

1.0
shepherd

Other source code

v6.1.6-react-shepherd: Prepare Release (#3063)
mongo express

Other source code

v1.1.0-rc-3

Related Information All