Recently, Microsoft research team cooperated with researchers from multiple universities to launch a multimodal AI model called "Magma". The design goal of this model is to perform complex tasks in digital and physical environments by integrating multiple data types such as images, text and video. With the rapid development of technology, multimodal AI agents are becoming more and more widely used in robotics, virtual assistants and user interface automation.
Previous AI systems usually focused on a single field, such as vision-language understanding or robotic operation, making it difficult to integrate these two capabilities into a unified model. Although many existing models perform well in specific fields, they have poor generalization capabilities in different application scenarios. For example, the Pix2Act and WebGUM models perform well in UI navigation, while OpenVLA and RT-2 are more suitable for robotic manipulation, but they often require training separately and are difficult to cross the boundaries between digital and physical environments.
The launch of the "Magma" model is precisely to overcome these limitations. It integrates multimodal understanding, action positioning and planning capabilities by introducing a powerful training method to enable AI agents to operate seamlessly in a variety of environments. Magma's training dataset contains 39 million samples covering images, videos and robot motion trajectories. In addition, the model adopts two innovative technologies: Set-of-Mark (SoM) and Trace-of-Mark (ToM). The former enables the model to mark actionable visual objects in the UI environment, while the latter enables it to track the movement of objects over time, thereby improving the planning capabilities of future actions.
"Magma" adopts advanced deep learning architecture and large-scale pre-training techniques to optimize its performance in multiple fields. The model uses the ConvNeXt-XXL visual backbone to process images and videos, and the LLaMA-3-8B language model is responsible for processing text input. This architecture enables "Magma" to efficiently integrate vision, language and action execution. After comprehensive training, the model has achieved excellent results on multiple tasks, showing strong multimodal understanding and spatial reasoning capabilities.
Project portal: https://microsoft.github.io/Magma/
Key points:
The Magma model has been trained in multiple samples and has strong multimodal learning capabilities.
The model successfully integrates vision, language and action, overcoming the limitations of existing AI models.
Magma has performed well in several benchmarks, showing strong generalization and excellent decision-making and execution capabilities.