In the field of artificial intelligence, the launch of DeepSeek-R1 marks a major breakthrough in AI technology. This innovation not only demonstrates the rapid development of the AI industry, but also opens up new possibilities for future AI applications through its unique Multi-head Latent Attention (MLA) architecture. Through low-rank compression technology, the MLA architecture significantly reduces the cost of training and inference, making it only one-tenth of the same performance big model. This result was jointly completed by Ji Tao, a postdoctoral fellow in the NLP laboratory of Fudan University and his team. Their goal is to enable arbitrary pre-trained large language models to quickly migrate to the MLA architecture through the MHA2MLA framework without the need to train from scratch.
Currently, mainstream big models are generally based on standard multi-head attention mechanisms (MHA) and their variants, which have significant disadvantages in inference costs compared to MLA. To solve this problem, the research team proposed the MHA2MLA framework, which successfully achieved the migration of the MHA/GQA architecture to MLA through two key steps - part of the RoPE retention and key-value joint to represent low-rank approximation. This innovation not only improves the efficiency of the model, but also provides more possibilities for future AI applications.

During the implementation of MHA2MLA, the team first separated the location encoding from the large dimension through some RoPE fine-tuning strategies, retaining a small number of dimensions related to the location, thereby resolving the conflict between MLA and RoPE. Next, a low-rank approximation of key-value vectors is performed by singular value decomposition (SVD) technique to maximize pre-training knowledge while significantly reducing cache space. Experimental results show that only fine-tuning is required to use 0.3% to 0.6% of the pretrained data to basically restore performance losses during migration. This achievement not only demonstrates the efficiency of the MHA2MLA framework, but also provides new directions for future AI research.
After being combined with other efficient inference techniques, such as 4-bit KV cache quantization, the KV cache of the Llama2-7B model has decreased by 92.19% while the performance loss is only 0.5%. This result demonstrates the superior compatibility of the MHA2MLA framework in compression technology, while maintaining the model's inference ability and long context processing ability, providing a new feasible path for deploying resource-efficient large language models. This innovation not only improves the efficiency of the model, but also provides more possibilities for future AI applications.
However, the research team also pointed out that the experiment is limited by hardware conditions and has not yet covered models such as Llama3 that require 128K long context fine-tuning. Future research will focus on expanding to more model architectures, and combining efficient parameter fine-tuning strategies to further reduce the scale of parameter updates during the migration process. Research in this direction will provide more possibilities for future AI applications and promote the further development of AI technology.