On March 11, the Tongyi Laboratory team announced the open source R1-Omni model, which has brought new breakthroughs to the development of all-modal models. The R1-Omni model combines reinforcement learning with verifiable reward (RLVR) methods, focusing on improving inference capabilities and generalization performance in multimodal emotion recognition tasks. This innovation not only injects new vitality into the field of artificial intelligence, but also provides important technical support for future multimodal research.
The training process of R1-Omni is divided into two key stages. During the cold start phase, the team used a combined dataset containing 580 video data for fine-tuning, mainly from the Explainable Multimodal Emotion Reasoning (EMER) dataset and the HumanOmni dataset. The purpose of this stage is to lay the foundation for the model's reasoning ability and ensure that it has certain multimodal emotion recognition ability before entering the RLVR stage, thereby ensuring the stability, efficiency and stability of subsequent training. Through this stage of training, the model can initially understand and process multimodal data, laying a solid foundation for subsequent optimization.

Subsequently, in the RLVR stage, the model is further optimized through reinforcement learning and verifiable reward mechanism. The key to this stage lies in the design of the strategy model and reward function. The strategy model is responsible for processing multimodal input data composed of video frames and audio streams, generating candidate responses with detailed inference processes, showing how the model integrates visual and auditory information to draw predictions. The reward function is inspired by DeepSeek R1 and is divided into two parts: precision reward and format reward, which together form the final reward. This design not only encourages the model to generate correct predictions, but also ensures that the output is structured and conforms to preset formats, thereby improving the overall performance of the model.
Experimental results show that R1-Omni has an average increase of more than 35% compared with the original baseline model on the same distribution test sets DFEW and MAFW, and an average increase of more than 10% compared with the supervised fine-tuning (SFT) model in the unweighted average recall rate (UAR). On different distributed test sets RAVDESS, its weighted average recall rate (WAR) and UAR both increased by more than 13%, demonstrating excellent generalization capabilities. In addition, R1-Omni also has significant transparency advantages. Through the RLVR method, the role of audio and video information in the model becomes clearer and more visible, and can clearly demonstrate the key role of each modal information on specific emotional judgments, providing an important reference for understanding the model decision-making process and future research.
paper:
https://arxiv.org/abs/2503.05379
Github:
https://github.com/HumanMLLM/R1-Omni
Model:
https://www.modelscope.cn/models/iic/R1-Omni-0.5B