This repository contains a Dockerfile that extends the PyTorch 21.02-py3 NGC container and encapsulates some dependencies. To create your own container, choose a PyTorch container from NVIDIA PyTorch Container Versions and create a Dockerfile as following format:
FROM nvcr.io/nvidia/pytorch:21.02-py3
WORKDIR /path/to/working/directory/text2speech/
COPY requirements.txt .
RUN pip install -r requirements.txtGo to the /path/to/working/directory/text2speech/docker
$ docker build --no-cache -t torcht2s .
$ docker run -it --rm --gpus all -p 2222:8888 -v /path/to/working/directory/text2speech:/path/to/working/directory/text2speech torcht2s$ python -m ipykernel install --user --name=torcht2s
$ jupyter notebook --ip=0.0.0.0 --port=8888 --no-browser --allow-roothttp://127.0.0.1:2222/?token=${TOKEN} and enter your token specified in your terminal.In order to train speech synthesis models, sounds and phoneme sequences expressing sounds are needed. That's wyh in the first step, the input text is encoded into a list of symbols. In this study, we will use Turkish characters and phonemes as the symbols. Since Turkish is a phonetic language, words are expressed as they are read. That is, character sequences are constructed words in Turkish. In non-phonetic languages such as English, words can be expressed with phonemes. To synthesize Turkish speech with English data, the words in the English dataset first must be phonetically translated into Turkish.
valid_symbols = ['1', '1:', '2', '2:', '5', 'a', 'a:', 'b', 'c', 'd', 'dZ', 'e', 'e:', 'f', 'g', 'gj', 'h', 'i', 'i:', 'j',
'k', 'l', 'm', 'n', 'N', 'o', 'o:', 'p', 'r', 's', 'S', 't', 'tS', 'u', 'u', 'v', 'y', 'y:', 'z', 'Z']To speed-up training, those could be generated during the pre-processing step and read directly from the disk during training. Follow these steps to use custom dataset.
text2speech/Fastpitch/dataset/ location. Those filelists should list a single utterance per line as:<audio file path>|<transcript>text2speech/Fastpitch/data_preperation.ipynb$ python prepare_dataset.py
--wav-text-filelists dataset/tts_data.txt
--n-workers 16
--batch-size 1
--dataset-path dataset
--extract-pitch
--f0-method pyin
--extract-mels create_picth_text_file(manifest_path) from text2speech/Fastpitch/data_preperation.ipynb
Those filelists should list a single utterance per line as:<mel or wav file path>|<pitch file path>|<text>|<speaker_id>The complete dataset has the following structure:
./dataset
├── mels
├── pitch
├── wavs
├── tts_data.txt # train + val
├── tts_data_train.txt
├── tts_data_val.txt
├── tts_pitch_data.txt # train + val
├── tts_pitch_data_train.txt
├── tts_pitch_data_val.txtThe training will produce a FastPitch model capable of generating mel-spectrograms from raw text. It will be serialized as a single .pt checkpoint file, along with a series of intermediate checkpoints.
$ python train.py --cuda --amp --p-arpabet 1.0 --dataset-path dataset
--output saved_fastpicth_models/
--training-files dataset/tts_pitch_data_train.txt
--validation-files dataset/tts_pitch_data_val.txt
--epochs 1000 --learning-rate 0.001 --batch-size 32
--load-pitch-from-diskThe last step is converting the spectrogram into the waveform. The process to generate speech from spectrogram is also called Vocoder.
Some mel-spectrogram generators are prone to model bias. As the spectrograms differ from the true data on which HiFi-GAN was trained, the quality of the generated audio might suffer. In order to overcome this problem, a HiFi-GAN model can be fine-tuned on the outputs of a particular mel-spectrogram generator in order to adapt to this bias. In this section we will perform fine-tuning to FastPitch outputs.
text2speech/Hifigan/data/pretrained_fastpicth_model/ directory.tts_pitch_data.txt in the text2speech/Hifigan/data/ directory.$ python extract_mels.py --cuda
-o data/mels-fastpitch-tr22khz
--dataset-path /text2speech/Fastpitch/dataset
--dataset-files data/tts_pitch_data.txt # train + val
--load-pitch-from-disk
--checkpoint-path data/pretrained_fastpicth_model/FastPitch_checkpoint.pt -bs 16Mel-spectrograms should now be prepared in the text2speech/Hifigan/data/mels-fastpitch-tr22khz directory.
The fine-tuning script will load an existing HiFi-GAN model and run several epochs of training using spectrograms generated in the last step.
This step will produce another .pt HiFi-GAN model checkpoint file fine-tuned to the particular FastPitch model.
results in the text2speech/Hifigan directory.$ nohup python train.py --cuda --output /results/hifigan_tr22khz
--epochs 1000 --dataset_path /Fastpitch/dataset
--input_mels_dir /data/mels-fastpitch-tr22khz
--training_files /Fastpitch/dataset/tts_data.txt
--validation_files /Fastpitch/dataset/tts_data.txt
--fine_tuning --fine_tune_lr_factor 3 --batch_size 16
--learning_rate 0.0003 --lr_decay 0.9998 --validation_interval 10 > log.txt$ tail -f log.txt Run the following command to synthesize audio from raw text with mel-spectrogram generator
python inference.py --cuda
--hifigan /Hifigan/results/hifigan_tr22khz/hifigan_gen_checkpoint.pt
--fastpitch /Fastpitch/saved_fastpicth_models/FastPitch_checkpoint.pt
-i test_text.txt
-o wavs/The speech is generated from a file passed with the -i argument.
The output audio will be stored in the path specified by the -o argument.