B station text to speech model IndexTTS: Support pinyin to correct Chinese character pronunciation and precise control of pauses - AI Articles

Author：Eve Cole Update Time：2025-05-25 15:25:01

Bilibili recently released a text-to-speech model called IndexTTS. This model is based on XTTS and Tortoise technology and adopts a GPT-style architecture. When processing Chinese text, this innovative system has a unique pinyin correction Chinese character pronunciation function, and can accurately control pauses at any position through punctuation marks. The introduction of this technology makes the text-to-speech effect more natural and smooth, greatly improving the user experience and attracting widespread attention.

After tens of thousands of hours of data training, the IndexTTS system has achieved industry leadership in performance, surpassing the current popular TTS systems, such as XTTS, CosyVoice2, Fish-Speech and F5-TTS. Multiple modules of the system have been deeply optimized, especially with significant improvements in speaker conditional feature representation and audio quality. By introducing hybrid modeling, IndexTTS can quickly correct misread Chinese characters, further improving the user experience.

The model adopts the latest conditional encoder and BigVGAN2-based voice decoder, which not only improves the stability of training, but also enhances the similarity and sound quality of sound. The R&D team said they have submitted relevant papers on arXiv and plans to release model parameters and code in the next few weeks. In addition, IndexTTS also provides a variety of test sets, including multisyllable vocabulary and subjective and objective review sets for in-depth analysis by researchers.

IndexTTS performed well in multiple reviews, especially in terms of word error rate (WER) and speaker similarity (SS), which outperformed many peer models. For example, in Mandarin tests, IndexTTS' word error rate was only 1.3%, which is much lower than the performance of other models, showing its strong accuracy and stability. At the same time, in the sound quality evaluation, IndexTTS' MOS score also reached 4.01, showing its excellent sound quality and tone.

With the continuous advancement of technology and the expansion of application scenarios, the release of IndexTTS marks the advancement of text-to-speech technology to a higher level. For more information about the system, users can contact the relevant team for detailed user experience and technical support.

Project address: https://github.com/index-tts/index-tts