Bert VITS2 ext Download - Bert VITS2 ext Source code download

Bert VITS2 ext

AI Source Code

1.0.0

Download

Introduction

The purpose of this project is to expand the usage boundaries of Bert-VITS2, such as TTS synchronously generating facial expression data.
See the effect
- Demo Bilibili Youtube
- TTS with Audio2PhotoReal --> MetaHuman
- Generate emoticon tests from singing; comparison of emoticons when talking to Azure TTS
- TTS generated emoji first version, fitted with MotionGPT

Extend to CosyVoice

CosyVoice emoticon test

Extended to GPT-SoVITS

GPT-SoVITS Expression Test
- Retraining directly on GPT-SoVITS, the actual test results are relatively bad
- The temporary method is to directly copy the posterior part and expression generation part model from Bert-VITS2-ext to GPT-SoVITS for testing.
- This results in a repeated calculation and more predictions

TTS

TTS code originated from Bert-VITS2 v2.3 Final Release [Dec 20, 2023]
- https://github.com/fishaudio/Bert-VITS2/commit/7ebc1aa28a055608f7e31da93928cf295fdffeba
After testing, I personally feel that the training effect of pure Chinese materials 2.1-2.3 is slightly worse than that of 1.0. You can consider reducing the version or mixing it with pure Chinese.
The training method of TTS itself is shown in the original text (each version is different)
- Recommended version 1.0 https://github.com/YYuX-1145/Bert-VITS2-Integration-package

TTS synchronous output emoji

Ideas

Refer to the network structure diagram of the VITS paper (not bert-vits2, but the general structure is the same), obtain the hidden variable z after text encoding and transformation and before decoding, and output the expression value from the bypass (Live Link Face value)
- Freeze the parameters of the original network, add LSTM and MLP processing separately, and complete seq2seq generation and mapping of z-to-expression
- Of course, if there is high-quality expression data, the expression can also be added to the original TTS network training, which should improve the audio quality

Network structure

Data collection

Set the Targets of Live Link Face to the native IP, and the port defaults to 11111
Synchronously collect voice and corresponding expression values output by Live Link, and store them into the records folder respectively
- Execute script collection for 20 seconds each time
- Audio defaults to 44100 Hz
- Audio and expression collection may be offset
- Comparison of the loss of the verification set to find the optimal offset position for the same data source

 python ./motion/record.py

Check if the data is normal
- Preview the weight numerical curve in npy
Test data
- Send the recorded bs data to MetaHuman in the UE through live link, play the voice synchronously, and check whether it matches

 python ./motion/tts2ue.py --bs_npy_file  ./records/2023-12-23-17-19-54.npy --wav_file ./records/2023-12-23-17-19-54.wav  --fps 60

Data preprocessing

Read all audio files in records, use the posterior encoder to store the hidden variable z encoded by the audio into *.z.npy
Write to the list of files for training and verification
- fileslists/val_visemes.list
- fileslists/train_visemes.list

 python ./motion/prepare_visemes.py

train

Add --visemes after train_ms.py to distinguish between training from mainnet

 python train_ms.py  -m OUTPUT_MODEL --config ./configs/config.json --visemes

reasoning

When webui.py is executed, the output audio, hidden variables, and animation data are written to the current directory. You can use tts2ue.py to view the generated effect
The default fps of the generated animation is the same as the hidden variable 86.1328125
- 44100/86.1328125 = 512, just divisible, this is determined by Bert-VITS2 audio sampling frequency, network structure and hidden_channels

 python ./motion/tts2ue.py --bs_npy_file ./tmp.npy --wav_file ./tmp.wav --delay_ms 700

Voice to expression

Use the posterior encoder to convert the sound to z, and then convert the z into an expression
- Audio needs to be converted into a 44100hz wav file and only one channel is retained (ffmpeg)

 # 音频截取转换
ffmpeg -i input_file -ss 00:00:00 -t 00:00:10 -ar 44100 -f wav test.wav

# 保留通道1
ffmpeg -i test.wav -map_channel 0.0.0 output.wav

python  ./motion/wav_to_visemes.py output.wav

Body animation

MotionGPT

With voice and expressions, you can also generate a matching action description under LLM drive, and then use the text to motion model to generate body animations that match the speech content, and even interact with the scene and others.
The project used for text to motion testing is MotionGPT
- No animation overprocessing has been done yet. The motion in-between is supported by the introduction model.
- The flan-t5-base used by MotionGPT cannot understand Chinese, so it cannot generate animations with high synchronization with speaking text (the word order changes somewhat after translation into English)
  - Whether it is possible to use the speaking text or hidden variable z to guide the generation of the action is not yet known
MotionGPT outputs the bone position, which cannot be directly connected to the UE's skeleton animation.
- Currently, simplified methods are used to estimate the movement value, and there will be considerable position losses
  - Calculate the amount of rotation change of the bone relative to the parent node (quaternion)
    - Reference code
  - Send it to the VMCtoMOP program through the OSC protocol, and preview the animation and perform protocol conversion
  - Map MOP data to MetaHuman with the help of the Mop plugin
    - The test version is UE5.3

audio2photoreal

Modify on the basis of the original project, export animation data to local files during web interface inference
- Export process
- Code https://github.com/see2023/audio2photoreal