DeepSeek Open Source Week Day 6: Ultimate Inference Optimization System to Improve GPU Computing Efficiency - AI Article

Author：Eve Cole Update Time：2025-05-25 23:50:02

With the rapid development of artificial intelligence technology, the DeepSeek team has launched its new DeepSeek-V3/R1 inference system. This system aims to drive the efficient development of general artificial intelligence (AGI) through higher throughput and lower latency. To achieve this, DeepSeek adopts Expert Parallelism (EP) technology, significantly improves GPU computing efficiency and scales up batch processing while reducing latency.

The core of DeepSeek-V3/R1 is its extremely high sparsity, with only 8 of 256 experts activated in each layer of the model, so a very large batch size is required to ensure that each expert has sufficient processing power. The architecture of this system adopts the prefill-decode disaggregation method, and adopts different degrees of parallelization strategies during the prefill and decoding stages.

During the pre-filling phase, the system hides communication costs through a double batch overlap strategy, which means that when processing one batch of requests, the communication costs of another batch can be masked by the calculation process, thereby improving overall throughput. In the decoding stage, in response to the time imbalance in different execution stages, DeepSeek adopts a five-level pipeline method to achieve seamless communication and computing overlap.

To cope with the load inequality caused by large-scale parallelism, the DeepSeek team has set up multiple load balancers. These load balancers are committed to balancing computing and communication loads across all GPUs, avoiding a single GPU becoming a performance bottleneck due to overload operations, and ensuring efficient utilization of resources.

In terms of service performance, the DeepSeek-V3/R1 inference service runs on the H800GPU, using matrix multiplication and transmission formats consistent with the training process. According to the latest statistics, the system has processed 608 billion input tokens in the past 24 hours, with the highest node occupancy rate of 278, and the average daily occupancy rate of 226.75, and the overall service performance is good.

Through efficient architectural design and intelligent load management, the DeepSeek-V3/R1 inference system not only improves the inference performance of artificial intelligence models, but also provides strong infrastructure support for future AGI research and application.

Project: https://github.com/deepseek-ai/open-infra-index/blob/main/202502OpenSourceWeek/day_6_one_more_thing_deepseekV3R1_inference_system_overview.md

Key points:

The DeepSeek-V3/R1 inference system achieves higher throughput and lower latency through cross-node expert parallel technology.

The dual-batch overlap strategy and five-level pipelines are adopted to improve computing efficiency and optimize the communication process.

Set up a variety of load balancers to ensure efficient utilization of resources between GPUs and avoid performance bottlenecks.