SummerTTS is used to commemorate the coming and passing summers of 2023
illustrate
- SummerTTS is an independently compiled speech synthesis program (TTS). It can be run locally without the need for network, and there is no additional dependency. One-click compilation can be used for Chinese and English speech synthesis.
- The underlying computing library of SummerTTS uses Eigen, which is a set of template-defined functions. In most cases, it only needs to include header files, so this project has no other dependencies and can be compiled and run independently in the C++ environment.
- This project uses the matrix library provided by Eigen to implement the operator of neural networks, and does not need to rely on other NN operating environments such as pytorch, tensorflow, ncnn, etc.
- This project is compiled and run on Ubuntu. Other Linux-like platforms such as Android, Raspberry Pi, etc. should not have any major problems. It has not been tested on Windows and may require a little change.
- The model of this project is based on the speech synthesis algorithm vits, and based on C++ engineering is carried out on it
- This project is applicable to MIT License. The development, user or organization based on this project, please follow MIT License: https://mit-license.org
Update log
- 2024-12-14: Add License information to MIT License: https://mit-license.org
- 2023-06-16: Updated to add a faster English voice synthesis model: single_speaker_english_fast.bin, or on the following network disk, the speed is faster, and the synthesized sound quality is not significantly reduced:
Link: https://pan.baidu.com/s/1rYhtznOYQH7m8g-xZ_2VVQ?pwd=2d5h Extraction code: 2d5h - 2023-06-15: Support pure English pronunciation synthesis, and you need to synchronize the latest code. Use the model file in the following network disk: single_speaker_english.bin, and synthesize English pronunciation in the following way:
./tts_test ../test_eng.txt ../models/single_speaker_english.bin out_eng.wav
The path of the network disk is as follows. The previous Chinese pronunciation synthesis and usage are not affected. It should be noted that this update only supports pure English pronunciation synthesis, and Chinese mixed English does not support it for the time being.
Link: https://pan.baidu.com/s/1rYhtznOYQH7m8g-xZ_2VVQ?pwd=2d5h Extraction code: 2d5h - 2023-06-09: A medium-sized single-speaker model has been added: single_speaker_mid.bin, which is slightly slower than the previous model, but the synthesized sound quality seems to be better (I am not sensitive to my ears, and I feel better, maybe it is a psychological effect: P). The code does not need to be updated, I just need to download single_speaker_mid.bin in the previous network disk and use it.
- 2023-06-08: Modify test/main.cpp to support the synthesis of newlines and whole text
- 2023-06-03: Fix has an error in yesterday's version. Thanks to enthusiastic netizen Telen for providing tests and clues. Only code updates are required, and the model does not need to be updated.
- 2023-06-02: The accuracy of polyphonic pronunciation synthesis has been greatly improved. A new model is needed in Baidu Netdisk to use the improved polyphonic pronunciation and text regularization (Text Normalization). The updated code today cannot use the previous model, otherwise it may lead to crash
- 2023-05-30: Integrated WeTextProcessing as the front-end text regularization module, greatly improving the correct pronunciation synthesis of numbers, currencies, temperatures, dates, etc. You need to get a new model in Baidu Netdisk below
- 2023-5-23: The use of new algorithms has greatly improved the voice synthesis speed of single speakers.
- 2023-4-21: Initial creation
Instructions for use
Cloning the code of this project locally, preferably an Ubuntu Linux environment
Download the model from the following Baidu network disk address and put it in the model directory of this project: Link: https://pan.baidu.com/s/1rYhtznOYQH7m8g-xZ_2VVQ?pwd=2d5h Extraction code: 2d5h
After the model file is placed, the model directory structure is as follows:
models/
├── multi_speakers.bin
├── single_speaker_mid.bin
├── single_speaker_english.bin
├── single_speaker_english_fast.bin
└── single_speaker_fast.bin
Enter the Build directory and execute the following command:
cmake ..
Make
After compilation is completed, the tts_test executor will be generated in the Build directory.
Run the following command to test Chinese speech synthesis (TTS):
./tts_test ../test.txt ../models/single_speaker_fast.bin out.wav
Run the following command to test English speech synthesis (TTS):
./tts_test ../test_eng.txt ../models/single_speaker_english.bin out_eng.wav
In this command line:
The first parameter is the path to the text file, which contains the text that needs to be synthesized for speech.
The second parameter is the path to the aforementioned model. Single and multi at the beginning of the file name indicate whether the model contains a single speaker or multiple speakers. Recommended single speaker model: single_speaker_fast.bin, the synthesis speed is faster and the sound quality of the synthesis is also OK. The third parameter is the synthesized audio file. After the program is run, it can be opened with a player.
The above test program is implemented in test/main.cpp, and the specific synthesized interface is defined in include/SynthesizerTrn.h, as follows:
int16_t * infer(const string & line, int32_t sid, float lengthScale, int32_t & dataLen)
The interface's:
The first parameter is the string of the speech to be synthesized.
The second parameter specifies that the speaker's id is used to synthesize speech. This parameter is valid for the multi-speaker model and fixed to 0 for the single-speaker model. The number of speakers can be returned by interface int32_t getSpeakerNum(), and the valid id is 0 and the number of speakers returned to the interface is reduced by 1.
The third parameter lengthScale represents the speech speed of the synthetic speech, and the larger its value indicates the speech speed is slower.
The text to be synthesized can contain Arabic numerals and punctuations, but because the text regularization (TN) module of this project is still very rough, it will be ignored for English characters. Also because the text regularization (TN) module is still very rough, the pronunciation of polyphonic characters in different contexts is sometimes inaccurate.
Follow-up development
- Model training and conversion scripts will be opened later
- In the future, we will try to train and provide models with better sound quality
Contact the author
- If you have any further questions or need it, you can send an email to [email protected] or add WeChat: hwang_2011. I will try my best to reply.
License
- This project is applicable to MIT License. The development, user or organization based on this project, please follow MIT License: https://mit-license.org
grateful
This project uses the following scheme in terms of source code and algorithms. Thank you here. If any legal problems may arise, please contact me in time to coordinate and resolve them.
- Eigen
- vits (https://github.com/jaywalnut310/vits)
- vits_chinese (https://github.com/UEhQZXI/vits_chinese)
- MB-iSTFT-VITS (https://github.com/MasayaKawamura/MB-iSTFT-VITS)
- WeTextProcessing (https://github.com/wenet-e2e/WeTextProcessing)
- glog (https://github.com/google/glog)
- gflags (https://github.com/gflags/gflags)
- openfst (https://github.com/kkm000/openfst)
- Chinese characters to pinyin (https://github.com/yangyangwithgnu/hanz2piny)
- cppjieba (https://github.com/yanyiwu/cppjieba)
- g2p_en(https://github.com/Kyubyong/g2p)
- English-to-IPA(https://github.com/mphilli/English-to-IPA)
- The Chinese single speaker model of this project is based on the open source Biaobei dataset training, the multi-speaker model is based on the open source dataset aishell3 training, and the English single speaker model is based on the LJ Speech dataset.