Xiaomi Technology’s official Weibo recently released an important announcement announcing that its big model team has made breakthrough progress in the field of audio inference. This achievement stems from the team's first application of reinforcement learning algorithms to multimodal audio comprehension tasks after being inspired by DeepSeek-R1. It is remarkable that the team successfully topped the internationally authoritative MMAU audio understanding review list with a 64.5% accuracy rate of SOTA (State Of The Art). At the same time, the Xiaomi team also decided to open source relevant technologies to promote further research in the academic and industrial sectors.

The MMAU (Massive Multi-Task Audio Understanding and Reasoning) evaluation set is an important criterion for measuring audio reasoning ability. It contains 10,000 voice, ambient sounds and music samples, aiming to comprehensively examine the performance of the model on multiple skills. According to the evaluation results, human experts have an accuracy rate of 82.23% on this review set, while the best-performing model is OpenAI's GPT-4o with an accuracy rate of 57.3%, followed by Google DeepMind's Gemini2.0Flash with an accuracy rate of 55.6%.
During the research process of Xiaomi team, they first used the AVQA dataset released by Tsinghua University for fine-tuning and achieved an accuracy rate of 51.8%. However, the real breakthrough happened after the team applied the DeepSeek-R1's Group Relative Policy Optimization (GRPO) algorithm to the Qwen2-Audio-7B model. Using only 38,000 training samples from AVQA, the team achieved an accuracy of 64.5%, successfully surpassing the existing business model.
The research team also found that when the model outputs an inference process during training, the accuracy rate actually dropped to 61.1%. This result shows that explicit thinking chain output may not be conducive to model training, and the real-time feedback mechanism of reinforcement learning is more helpful for the model to lock in the distribution area of high-quality answers. Although the team has achieved significant accuracy, there is still a certain gap compared to the level of human experts.
The experimental results of Xiaomi’s big model team not only show the unique advantages of reinforcement learning in the field of audio inference, but also provide new ideas for future research. To promote further cooperation between academic and industry, the team decided to open source the training code, model parameters and technical reports. This move will undoubtedly accelerate the development of audio inference technology and provide valuable resources for researchers in related fields.
Training code: https://github.com/xiaomi-research/r1-aqa
Model parameters: https://huggingface.co/mispeech/r1-aqa
Technical report: https://arxiv.org/abs/2503.11197
Interaction Demo: https://120.48.108.147:7860/
Key points:
Xiaomi’s big model team made breakthroughs in the field of audio inference through reinforcement learning algorithms, with an accuracy rate of 64.5%.
The MMAU evaluation set is an important criterion for audio reasoning capabilities, with the current accuracy rate of human experts of 82.23%.
The research results show that the real-time feedback mechanism of reinforcement learning is more effective for model training, and future research still needs to be explored in depth.