Technical evaluation report shows: Claude 3.5 Sonnet model has reached professional doctorate level

Author：Eve Cole Update Time：2025-02-24 23:00:02

The Claude 3.5 Sonnet large-scale language model developed by Anthropic has performed extremely well in recent tests. It achieved an astonishing score of 67.2% in the graduate-level scientific question answering (GPQA) test, breaking the 65% mark for the first time and even surpassing the professional Average Ph.D. This breakthrough marks that large language models have reached a new milestone in understanding and answering advanced scientific knowledge, and also brings unlimited possibilities for the future application of artificial intelligence in various fields. The following will explain in detail the outstanding performance of Claude 3.5 Sonnet and the technological breakthroughs behind it.

Anthropic's latest model, the Claude 3.5 Sonnet, demonstrated impressive performance in recent technical reviews, exceeding even professional PhD levels. In the Graduate-Level Question Answering (GPQA) test, Claude3.5Sonnet achieved a score of 67.2%. This is not only the first time that a large language model has exceeded 65% in such an evaluation, but also marks its progress in understanding and answering advanced science. Knowledge issues have reached new heights.

GPQA is a benchmark test that measures the ability of language models to answer scientific questions at the graduate level. It covers a series of complex and esoteric questions and places high demands on the model's reasoning and knowledge integration capabilities. On this challenging test, the average score for general doctorate holders is about 34%, while the average score for doctorate holders in specialized fields is 65%. It is worth mentioning that a language model with a GPQA score of 60% has an intelligence level approximately equivalent to IQ150.

Although there is currently no specific data on GPT-4o and GPT-4T in GPQA evaluation, it is speculated based on the available information that Claude3.5Sonnet seems to perform better than these two models. In other related evaluations, such as the 0-shot CoT evaluation, Claude3.5Sonnet also scored higher than GPT-4o (53.6%) and GPT-4T (48.0%), further proving its leadership in language understanding and question answering. status.

This achievement of Anthropic not only demonstrates the powerful capabilities of Claude3.5Sonnet, but also sets a new benchmark for large language models in handling advanced knowledge question and answer tasks. With the continuous advancement of technology, the application potential of these models in various fields will undoubtedly be broader in the future.

The breakthrough performance of Claude 3.5 Sonnet indicates that large language models will play an increasingly important role in the fields of scientific research and knowledge acquisition. In the future, we will look forward to more similar breakthroughs to promote the continuous development of artificial intelligence technology and benefit human society. .