Google's major upgrade of AI voice technology: 2 minutes of dialogue and 3 seconds of generation, which will completely change the way human-computer interaction - AI Articles

Author：Eve Cole Update Time：2025-02-15 12:48:02

Google's latest voice generation technology has once again refreshed the industry standard. This breakthrough technology not only generates natural conversations of up to 2 minutes in 3 seconds, but also ensures voice coherence and sound quality performance among multiple speakers. The technology has been used in multiple Google products such as Gemini Live and Project Astra, and is changing the way people interact with digital assistants and AI tools around the world.

Over the past few years, Google has been focusing on research in the field of audio generation. The models they developed can create high-quality, natural voice through a variety of input methods such as text, rhythm control and specific sounds. Recently, Google has teamed up with multiple internal teams to launch two important features: NotebookLM audio overview can convert uploaded documents into vivid conversations; Illuminate can generate formal AI discussions about research papers, making expertise easier to understand. and digest.

These breakthroughs are based on several previous research results from Google. From SoundStream neural audio codecs, to AudioLM audio language modeling framework, to SoundStorm, which can generate more than 30 seconds of conversations, Google is constantly innovating in the field of voice generation. The latest technological breakthrough uses more efficient voice codecs that can compress audio at a low bit rate of 600 bits per second while maintaining output quality.

To achieve this technological breakthrough, Google has developed a special Transformer architecture that can efficiently process information hierarchy. The model is first pre-trained on hundreds of thousands of hours of speech data and then fine-tuned on a high-quality conversation dataset that contains natural features such as tone pauses in real conversations. To ensure responsible use of the technology, Google has also integrated SynthID technology to add watermarks to the audio content generated by AI.

Looking ahead, Google is working to improve the smoothness, sound quality of the model, and add more detailed controls. Combined with the Gemini series of models, this technology is expected to play an important role in improving educational experience and content accessibility, bringing more possibilities to voice technology.

The importance of this technology is not only in its performance improvement, but also in its opening of a new chapter for human-computer interaction. By transforming complex technological innovations into natural, intuitive ways of interacting, Google is laying the foundation for the next generation of digital experiences.

Details: https://deepmind.google/discover/blog/pushing-the-frontiers-of-audio-generation/

Google's voice generation technology is not only a technological leap, but also a revolutionary advancement in human-computer interaction, bringing unlimited possibilities to the future digital world.