At a recent launch, Google CEO Sundar Pichai announced a major breakthrough: Google opens its latest multimodal mockup Gemma-3. With its low cost and high performance, this model quickly became the focus of the technology industry. The release of Gemma-3 marks another important progress by Google in the field of artificial intelligence, especially in multimodal processing and long context processing.
Gemma-3 provides four options for different parameter scales, namely 1 billion, 4 billion, 12 billion and 27 billion parameters. Among them, a model with a 27 billion parameter only requires one H100 graphics card to make efficient inference, and this computing power requirement is only one-tenth of that of similar models. This breakthrough makes Gemma-3 one of the high-performance models with the lowest computing power requirements, greatly reducing the threshold for use.
According to the latest test data, Gemma-3 performs very well in various conversation models, second only to the well-known DeepSeek model, surpassing OpenAI's multiple popular models such as o3-mini and Llama3. The Gemma-3 architecture continues the design of the general-purpose decoder Transformer from the previous two generations, but has carried out multiple innovations and optimizations on this basis. In order to solve the memory problem caused by long contexts, Gemma-3 adopts an architecture of interleaving local and global self-attention layers, which significantly reduces memory usage.
In terms of context processing capabilities, the context length supported by Gemma-3 is extended to 128Ktoken, providing better support for processing long text. In addition, Gemma-3 also has multimodal capabilities, can process text and images at the same time, and integrates a VisionTransformer-based vision encoder, effectively reducing the computational cost of image processing. These features make Gemma-3 perform well in complex tasks.
During the training process, Gemma-3 used more token budgets, especially 14T token volumes in the 27 billion parameter model, and introduced multilingual data to enhance the model's language processing capabilities. Gemma-3 supports 140 languages, of which 35 can be used directly. Through advanced knowledge distillation technology, Gemma-3 optimizes model performance through reinforcement learning later in the training period, especially in terms of helpability, reasoning ability and multilingual ability.
After evaluation, Gemma-3 performed well in multimodal tasks, and its long text processing capabilities were impressive, achieving an accuracy of 66%. In addition, Gemma-3's performance is also among the top in the dialogue ability assessment, showing its comprehensive strength in various tasks. These results make Gemma-3 one of the most popular multimodal models.
The open source address of Gemma-3 is: https://huggingface.co/collections/google/gemma-3-release-67c6c6f89c4f76621268bb6d. This open source initiative will further promote the development of artificial intelligence technology and provide researchers and developers with powerful tools and resources.
Key points: Gemma-3 is Google's latest open source multimodal model, with parameters ranging from 1 billion to 27 billion, and the computing power demand is reduced by 10 times. The model adopts an innovative architectural design to effectively process long context and multimodal data, supporting simultaneous processing of text and images. Gemma-3 supports processing capabilities in 140 languages. After training and optimization, it performs excellently in multiple tasks and demonstrates strong comprehensive capabilities.