ElevenLabs, as a pioneer in the field of artificial intelligence voice cloning and generation, recently released its latest voice-to-text model - Scribe v1. This innovative model demonstrates excellent accuracy in multiple languages and can be experienced by users through their official website.

According to ElevenLabs benchmarks, Scribe surpasses Google's Gemini2.0Flash, OpenAI's Whisper v3, and Deepgram Nova-3 in terms of accuracy in converting spoken language to text, achieving unprecedented low error rates. The model supports high-precision transcription in 99 languages, including some previously overlooked languages such as Serbian, Cantonese and Malayalam.
Flavio Schneider, chief researcher at ElevenLabs, said on social platform X that Scribe is the "cleverest audio understanding model" the company has released so far. He further explained that Scribe is not only a transcription tool, it can also understand audio content, detect nonverbal events (such as laughter, sound effects, music and background noise), and analyze long-term audio content in complex environments for accurate speaker distinction. It is particularly worth mentioning that Scribe is able to identify and isolate up to 32 different speakers in the same audio file.

ElevenLabs reminds users that Scribe is "best suitable for occasions where high-precision transcription is required, rather than real-time transcription." The company also plans to launch a low-latency version to expand its use in real-time applications.
According to benchmark results from FLEURS and Common Voice, Scribe has performed well in dealing with real-world audio challenges, especially in terms of word error rates in Italian (98.7% accuracy) and English (96.7% accuracy).
Scribe is now available through the ElevenLabs official website and API, priced at $0.40 per hour for audio input and will enjoy a 50% discount in the next six weeks. In addition, low-latency versions for real-time applications are also under development.
For enterprise decision makers, Scribe provides a scalable tool for high-precision transcription for industries that require automated documentation, conference transcription, and content accessibility. The model's high-precision processing of multiple languages will also benefit multinational corporations, media companies and customer support applications.
It is worth noting that Scribe's release was held on the same day as the release of its text-to-speech model Octave, a competitor Hume. Octave is a text-to-speech tool based on large language models, where users can customize AI-generated sounds based on emotional needs, designed for content creation, such as audiobooks, podcasts, and video game dubbing. Although Scribe and Octave have different capabilities, the releases of the two reflect the increasingly fierce competition in AI-driven audio models.
Product portal: https://elevenlabs.io/blog/meet-scribe
Key points:
Scribe v1 is ElevenLabs' latest voice-to-text model, with a record accuracy rate in multilinguals.
Supports 99 languages, can distinguish up to 32 different speakers and adapt to complex audio environments.
Currently priced at $0.40 per hour, enjoy a 50% discount for the next six weeks, and the low-latency version is under development.