In the field of artificial intelligence, Moore Thread once again led technological innovation and announced two major AI frameworks: open source MT-MegatronLM and MT-TransformerEngine. This major move not only injects new vitality into domestic computing infrastructure, but also provides powerful tool support for AI developers around the world. By deeply integrating FP8 hybrid training strategies and high-performance operator library, these two frameworks have achieved hybrid parallel training and inference on domestic full-function GPUs, significantly improving the efficiency and stability of large-scale model training.
The MT-MegatronLM framework is specially designed for full-function GPUs, supports efficient training of density models, multimodal models and MoE (hybrid expert) models, meeting the current diverse training needs in the AI field. MT-TransformerEngine focuses on the training and inference optimization of Transformer models. Through operator fusion and parallel acceleration strategies and other technologies, it effectively releases the potential of high-density computing of Moore threads, which significantly improves the efficiency of the memory bound operator.

The technological breakthroughs of these two frameworks are mainly reflected in the deep collaboration of hardware adaptation and algorithm innovation. First, they support mixed parallel training of multiple types of models, which can flexibly deal with complex computing scenarios of different model architectures; second, combined with the FP8 hybrid precision training strategy natively supported by Moore thread GPU, effectively improve training efficiency; third, through the deep integration of the high-performance operator library muDNN and the communication library MCCL, the communication overhead of computing-intensive tasks and multiple cards is systematically optimized; at the same time, combined with the open-source Simumax library, parallel strategy search can be automatically conducted and parallel training performance can be maximized for different models and acceleration environments; in addition, the framework's built-in rewind exception recovery mechanism can automatically roll back to the recent stable node for training, greatly improving the stability of large-scale training; finally, the two frameworks are compatible with the mainstream GPU ecosystem, which not only ensures the smooth migration of the existing ecosystem, but also provides underlying support for developers to build their own AI technology stack.

In practical applications, the performance of these two frameworks is impressive. On a full-function GPU cluster, the training task of the Llama38B model can reach more than 90% with FP8 technology when the loss is almost lossless, which is 28% higher than the original training speed. In addition, Moore threads have been deeply integrated and open source to efficiently support the DeepSeek parallel algorithm DualPipe. After MT-DualPipe is fully connected to the MT-Megatron framework and MT-TransformerEngine framework, it has successfully realized the complete reproduction of the DeepSeek V3 training process, supporting MLA, MTP and a variety of expert balance strategies. Through a variety of Transformer operator fusion technologies, these frameworks have significantly improved memory bandwidth utilization, effectively alleviated the memory bound bottleneck, and further released the hardware potential of domestic GPUs.
Moore Threading said it will continue to optimize these two frameworks and plans to introduce a series of new functions: including Dual Pipe/ZeroBubble parallel strategy to further reduce bubble rate and improve parallel training efficiency; a variety of original FP8 optimization strategies to improve training performance and stability; asynchronous checkpoint strategies to improve fault tolerance and efficiency during training; optimized recomputation strategies to reduce computing and memory overhead and improve training speed; original fault-tolerant training algorithms to enhance fault tolerance during training; and integrating Moore Thread FlashMLA and DeepGemm libraries to further release the computing power and FP8 computing power of Moore Thread GPUs to comprehensively improve computing performance and efficiency.
This series of technological breakthroughs and open source measures not only demonstrates Moore Thread's strength in the field of AI computing power, but also opens up new possibilities for the development of domestic AI infrastructure. Let us wait and see more breakthroughs it brings in the field of AI model training.