DeepSeek officially released its latest technology achievement on the first day of Open Source Week - FlashMLA, a multi-layer Attention decoding kernel designed specifically for Nvidia Hopper architecture GPUs. This technology is particularly optimized for variable-length sequence scenarios, which can significantly improve the inference performance of large models and bring new breakthroughs to the field of deep learning.

The core technical features of FlashMLA include comprehensive support for BF16 accuracy and the use of a Paged KV Cache system with a block size of 64 to achieve more precise memory management. In terms of performance, based on the CUDA12.6 platform, FlashMLA has achieved remarkable results on the H800SXM5GPU: in memory-constrained scenarios, the processing speed reaches 3000GB/s, and in computing-constrained scenarios, the computing power level is as high as 580TFLOPS.
The project has passed the production environment verification and demonstrates excellent stability. The development team said that the design of FlashMLA borrowed the excellent experience of projects such as FlashAttention2&3 and cutlass, and on this basis, it achieved innovative breakthroughs, further improving its application capabilities in complex scenarios.
Developers can quickly deploy FlashMLA with simple installation commands. Just execute "python setup.py install" to complete the installation, and then you can run the test script "python tests/test_flash_mla.py" to experience its excellent performance.
Open source address: https://github.com/deepseek-ai/FlashMLA