Sesame releases CSM model: Real-time emotional customization AI speech synthesis moves to a new level - AI Articles

Author：Eve Cole Update Time：2025-05-20 07:50:02

On March 13, Sesame officially released its latest voice synthesis model CSM, which quickly attracted widespread attention from the industry. According to official introduction, CSM adopts an end-to-end multimodal learning architecture based on Transformer, which can deeply understand context information and generate natural and emotional voice. The sound effects are extremely realistic, almost the same as real people, and are amazing.

The CSM model not only supports real-time voice generation, but also handles text and audio input. Users can adjust parameters to control characteristics such as tone, tone, rhythm and emotions, showing extremely high flexibility. This personalized voice generation ability allows CSM to perform well in a variety of application scenarios.

CSM is considered a major breakthrough in the field of AI voice technology. Its pronunciation is extremely natural, and it even reaches the level of "unknown to be artificial synthesis or real person". Some users recorded a video to show that CSM has almost no delay and called it "the strongest model ever experienced." Previously, Sesame had opened the source of the small version of CSM-1B, which supports multiple rounds of dialogue to generate coherent voice, which has received widespread praise.

At present, CSM has mainly trained for English and performed very well. However, CSM still has certain limitations in terms of multilingual support. Currently, the model does not support Chinese, but Sesame said it is expected to expand its language support in the future to meet the needs of more users.

Sesame also said it will open source its research results, a decision that has sparked heated discussions among community developers on GitHub. CSM is not only suitable for conversational AI, but may also promote innovation in voice interaction experience in areas such as education and entertainment. Industry insiders generally believe that CSM may reshape the standards of AI voice assistants and bring a more natural human-computer dialogue experience.