| Task | Notebook |
|---|---|
| Whisper_Vits_Japanese (built-in Ella dataset) |
This project uses Google's Whisper project as the VITS data processor. By modifying the Whisper project's transcribe.py, it generates the corresponding Srt file for the audio (the deleted PR is used here, and the PR is no longer found, so it cannot be referenced to the original author). At the same time, the limit of Whisper can only read a few audio files is relaxed to the point where it can traverse all audio files in the folder. Whisper can output Srt to make input of long audio possible, and users no longer need to cut the audio into pieces, or even transliterate the text of long audio. We directly rely on Whisper for speech recognition and data preparation, automatically slice into short audio, automatically generate transcript files, and then send them to the VITS training process. Considering that long-term audio dry sound is easier to obtain, VITS entry barriers are greatly reduced again.
The processing process is roughly as follows: The Srt file recognized by Whisper will be processed by auto.py. The processing process refers to tobiasrordorf/SRT-to-CSV-and-audio-split: Split long audio files based on subtitle-info in SRT File (Transcript saved in CSV) (github.com). The audio file is first converted to 22050Hz and 16 bits, and then the timestamps of the Srt file with the same name and the speech recognition transcript are converted to a csv file. The Csv file has the start time and end time of each segment of the audio, as well as the corresponding transcript and audio file paths. Then, the AudioSegment package is used to split long audio according to the start time and end time, and audio files with suffixes are generated in the order of slices, such as A_0.wav and A_1.wav, etc. All the sliced audio will be stored in the slice_audio folder, and then the txt file with "path|translation" required by VITS will be generated under the filelists folder. The subsequent data flow can be directly connected to the VITS part.
The VITS cleaner and symbol I use now is CjangCjengh/vits: VITS implementation of Japanese, Chinese, Korean and Sanskrit (github.com) as the initial version of the Creation God period. Now his warehouse has updated more cleaners and symbols, but I am a very nostalgic person, and I miss the time when everyone came to VITS at the beginning, so I still use the original version. VITS has two main preprocessing, one is monotonic align and the other is preprocess.py, and then you can start train.py. I put all the processes into whisper-vits-japanese.ipynb, and I just need to click them step by step to run. The only thing that needs the user to change is to replace my audio zip path with your own audio zip, and the rest of the parts do not need to be modified. Finally, I added the instructions to save the model and processed files to the network disk, and to restore the last latest checkpoint from the network disk during the next training.
Just name the audio file format speakerId_XXXX.wav and upload it to the audio folder. Then follow the general steps to run it. When the audio processing is done, run the auto_ms.py file, and the txt file will be automatically generated, with the format Path|speakerId|text.
Note: If you use auto_ms.py to generate txt, you must modify it to the code in the Alignment and Text Conversion step: (because the text_index is not 1 but 2 when training for multiple people)
python preprocess.py --text_index 2 --text_cleaners japanese_cleaners --filelists /content/whisper-vits-japanese/filelists/train_filelist.txt /content/whisper-vits-japanese/filelists/val_filelist.txt
python train_ms.py -c configs/ms.json -m ms
hps = utils.get_hparams_from_file("./configs/ms.json")
net_g = SynthesizerTrn(
len(symbols),
hps.data.filter_length // 2 + 1,
hps.train.segment_size // hps.data.hop_length,
n_speakers=hps.data.n_speakers,
**hps.model).cuda()
_ = net_g.eval()
_ = utils.load_checkpoint("/root/autodl-tmp/logs/ms/G_29000.pth", net_g, None)
stn_tst = get_text("ごめんね優衣", hps)
with torch.no_grad():
x_tst = stn_tst.cuda().unsqueeze(0)
x_tst_lengths = torch.LongTensor([stn_tst.size(0)]).cuda()
sid = torch.LongTensor([11]).cuda() //11指speakerId为11,如果有12个n_speaker,编号就从0-11
audio = net_g.infer(x_tst, x_tst_lengths, sid=sid, noise_scale=.667, noise_scale_w=0.8, length_scale=1)[0][0,0].data.cpu().float().numpy()
ipd.display(ipd.Audio(audio, rate=hps.data.sampling_rate, normalize=False))