Million Cost Revealing the Golden Rules of LLM Training, Jieyuexingchen launches hyperparameter optimization tools applicable to all fields - AI Articles

Author：Eve Cole Update Time：2025-05-20 02:00:02

In the field of artificial intelligence, a costly experiment is quietly changing the way large language models are trained. The Step-by-Star Research Team recently released an important research result. They trained 3,700 models of different sizes from scratch by spending nearly 1 million NVIDIA H800 GPU hours of computing power, and trained a total of 100 trillion tokens, revealing a universal scaling rule called "Step Law". This discovery provides a new guiding direction for efficient training of large language models.

This study is not only an exploration of hyperparameter optimization, but also a comprehensive examination of the stability of model optimal hyperparameters under different shapes, sparseness and data distribution. The research results show that Step Law shows extremely robustness regardless of the architectural design of the model and the language or field of the training data, which greatly enhances the value of the tool in practical applications.

The 3,700 models trained by the research team cover configurations of different scales, different hyperparameter combinations, different shapes, different data ratios and different sparsity, including two architectures: MoE and Dense. Through these massive experiments, they found that the optimal learning rate shows a power-law change with the model parameter scale and data scale, and the optimal batch size is mainly related to the data scale. This discovery subverts the industry's traditional understanding of hyperparameter settings.

元宇宙科幻赛博朋克绘画 (1)大模型

Experimental data show that under the condition of fixed model size and data size, the hyperparameter-optimized Landscape presents obvious convex characteristics, which means there is a stable and easy-to-find optimal hyperparameter area. To verify this, the research team constructed a three-dimensional visual space to visually demonstrate the impact of learning rate and batch size on training losses. The results clearly show the "valley" shape, with the convex bottom end being a relatively flat area, which provides a valuable theoretical basis for hyperparameter tuning in practice.

To make this discovery benefit the entire AI community, the team developed and launched a common optimal hyperparameter estimation tool. Compared with the global optimal hyperparameters obtained through exhaustive search, the performance gap between the prediction results of this tool is only 0.09%. This means that researchers and engineers can no longer rely on expensive grid searches, but directly obtain near-optimal hyperparameter configurations through this tool.

What is even more impressive is the universality of Step Law. The research team verified its scope of application from three different angles: first, no matter how the model shape changes - whether it is biased towards width, depth, or width-depth balance - Step Law can accurately predict the optimal hyperparameter region; second, this rule not only applies to the Dense model, but also extends well to MoE models with different sparseness; finally, whether the training data is English-led, Chinese-English bilingual, code-to-English mixed, or code-based distribution, Step Law shows amazing stability.

The research also reveals the optimization direction of learning rate scheduling strategies. Unlike traditional learning rate decay strategies, the team proposed to adopt a fixed minimum learning rate (1e-5) instead of setting the minimum value to one tenth of the maximum value in the traditional method. This change allows training to maintain a more reasonable parameter update step size in the later stage, effectively avoiding the continuous oscillation of the loss function in the convergence stage.

In addition, the study found that smoothing training losses are highly consistent with the optimal hyperparameters of verification losses, which provides a more economical approach to hyperparameter selection—the researchers can guide hyperparameter adjustments by monitoring smoothing training losses without frequently evaluating model performance on the verification set.

Despite the remarkable results, the Jieyuexing Research Team admitted that this is just the beginning. They plan to conduct various details of open source experiments, including the final checkpoints of nearly 4,000 models, for more in-depth analysis and theoretical explanations throughout the community. Future research directions include exploring the convexity of Loss-BS-LR three-dimensional space, improving the fitting method of optimal hyperparameters, explaining the changes in the next optimal region of different configurations, and in-depth research on training dynamics under different settings.

Follow-up work in the Predictable Scale series may further discuss the performance prediction of super-large model, the scaling properties of Code&Math, and the scaling characteristics of different Attention types. It can be foreseen that this series of research will provide more comprehensive theoretical guidance and practical tools for efficient training of large language models, and promote AI technology to develop in a more efficient and controllable direction.