The first AI college entrance examination evaluation results are released, GPT-4o takes second place

Author：Eve Cole Update Time：2025-02-23 21:50:02

The Shanghai Artificial Intelligence Laboratory recently conducted a unique "College Entrance Examination". Seven AI models, including GPT-4o, underwent comprehensive tests in Chinese, mathematics and English, using the national new curriculum standard I paper and manual marking. approach to ensure the fairness and impartiality of the test. This test aims to evaluate the AI model's ability to handle college entrance examination questions and provide reference data for future AI technology development. The models participating in the test cover many well-known institutions at home and abroad, showing the different directions and levels of current AI technology development.

In the world of artificial intelligence, the college entrance examination is no longer just a stage for humans. Recently, the Shanghai Artificial Intelligence Laboratory used a unique "College Entrance Examination" to let us witness the academic strength of AI. They adopted the OpenCompass evaluation system and subjected seven AI models, including GPT-4o, to comprehensive proficiency tests in Chinese, mathematics and English.

2_1718848649312_ai2023_A_large_classroom_filled_with_rows_of_robots_sitting_at__db532bea-895e-4609-b80c-5fedf4ecf846.png

This test used Paper I of the new national curriculum standard, which ensured that all participating open source models were open sourced before the college entrance examination, ensuring the fairness of the test. Moreover, these AI "answer papers" are manually judged by teachers with experience in college entrance examination marking, striving to be close to the real marking standards.

The models participating in the evaluation come from different backgrounds, including the open-source Mixtral8x22B dialogue model from the French AI startup Mistral, Yi-1.5-34B from Zero One Thousand Things Company, GLM-4-9B from Zhipu AI, and InternLM2 from the Shanghai Artificial Intelligence Laboratory. -20B-WQX, and Alibaba’s Qwen2 series. GPT-4o participates in the evaluation as a closed source model and is for reference only.

The results were announced. Qwen2-72B ranked first with a total score of 303 points, followed by GPT-4o with 296 points, and InternLM2-20B-WQX ranked third with 295.5 points. These models performed well in Chinese and English subjects, with an average score of 67% in Chinese and 81% in English. However, in the mathematics subject, the average score rate of all models is only 36%, showing that AI still has a lot of room for improvement in mathematical reasoning.

The marking teacher conducted a comprehensive analysis of the AI model's answer sheet. In Chinese subjects, the models are generally good at reading and understanding modern texts, but they are slightly deficient in classical Chinese and composition. In terms of mathematics, although the models have strong formula memory capabilities, they are still lacking in flexible application in the problem-solving process. The overall performance of the English subject is good, but on certain question types, some models have lower scoring rates.

This "big model college entrance examination" not only allows us to see the potential of AI in the academic field, but also reveals their limitations in understanding and applying knowledge. As technology continues to advance, we have reason to believe that AI in the future will become smarter and better serve human society.

Through this AI "college entrance examination", we can clearly see the progress and shortcomings of AI technology, which provides valuable experience for future AI development and a new perspective for our understanding of artificial intelligence. It is believed that in the near future, AI will demonstrate more powerful capabilities in more fields.