Recently, the Ling team of Ant Group released a compelling technical paper on the preprint Arxiv platform, titled "Every FLOP is Crucial: Scaling 300 billion parameters hybrid expert LING model without an advanced GPU." This paper details two new large language models they developed: Ling-Lite and Ling-Plus. The two models are designed with several innovative technologies that can be efficiently trained on low-performance hardware, significantly reducing costs.
The parameter scale of Bailing Lightweight Edition is 16.8 billion, of which the activation parameters are 2.75 billion. The enhanced dock model has up to 290 billion parameters and 28.8 billion activation parameters. The performance of both models reaches the industry-leading level, especially the enhanced version. Its 300 billion parameter MoE model performs comparable to that of high-end Nvidia chip models when trained on low-performance devices using domestic GPUs.

Typically, training of MoE models requires reliance on expensive high-performance GPUs, such as Nvidia's H100 and H800, which is not only expensive, but also limited by chip shortages, which affects its application in resource-limited environments. To this end, the Ant Group Ling team proposed a brand new goal - "not using advanced GPUs" to expand the model, breaking through resource and budget limitations. Their innovative training strategies include dynamic parameter allocation, mixed precision scheduling, and upgraded training exception handling mechanisms. These strategies effectively shorten interrupt response time, optimize model evaluation process, and compress validation cycles by more than 50%.
During the experiment, the Ling team conducted Ling-Plus pre-training on 9 trillion tokens. The results show that the cost of training a 1 trillion token using high-performance hardware configuration is about 6.35 million yuan, while after using Ant's optimization method, the training cost of low-spec hardware has been reduced to about 5.08 million yuan, saving nearly 20%. At the same time, the performance is comparable to Alibaba Tongyi Qwen2.5-72B-Instruct and DeepSeek-V2.5-1210-Chat.
If this technological achievement can be widely used, it will provide more cost-effective solutions for domestic large models, reduce dependence on Nvidia chips, and open up a new path for the future development of artificial intelligence.