Recently, the ByteDance Doubao Big Model Team and the MAP open source community jointly released SuperGPQA, a knowledge reasoning benchmark test covering 285 graduate-level disciplines and 26,529 professional questions. This innovative data set not only covers mainstream disciplines such as mathematics and physics, but also includes long-tail disciplines such as light industry, agriculture, and service science into the evaluation system for the first time, filling the gap in existing benchmark tests in the field of long-tail knowledge.
The launch of SuperGPQA marks an important milestone in the field of AI. This dataset was constructed in half a year through the expert-LLM collaboration mechanism to screen problems from authoritative sources. Its questions provide an average of 9.67 options, and 42.33% of them require mathematical calculations or formal reasoning, both breadth and depth. Experiments show that the accuracy of the optimal model DeepSeek-R1 is only 61.82%, indicating that the current large language model still has room for improvement in the diverse knowledge fields.
Traditional benchmarks such as MMLU and GPQA cover less than 50 disciplines, while long-tail disciplines account for less than 5%. Due to the single data source (such as Wikipedia) and unreliable crowdsourcing annotation, it is difficult to measure the model's inference ability in complex scenarios. SuperGPQA improves quality through three-stage processes: expert screening of original problems, standardized transcription, multi-layer quality inspection (rule filtering, LLM testing, expert review). The evaluation results show that instruction fine-tuning significantly improves performance, such as DeepSeek-V3 scores exceed the basic version, but the open source model still lags behind closed source solutions in difficult issues.
SuperGPQA has been used to reveal the performance gap between open source and closed source models and has become an important tool for the development of AI. The release of this benchmark test not only provides new evaluation standards for AI research, but also points out the direction for future model optimization and improvement of knowledge reasoning capabilities.
Paper link: https://arxiv.org/pdf/2502.14739
Data link: https://huggingface.co/datasets/map/SuperGPQA
Code link: https://github.com/SuperGPQA/SuperGPQA