With the rapid development of artificial intelligence technology, language models are becoming more and more widely used in multiple fields. However, a new OpenAI study revealed that these models fared far short of expectations when answering factual questions, triggering a rethinking of AI’s ability to acquire knowledge.
A recent study by OpenAI shows that despite the rapid development of artificial intelligence technology, the current most advanced language models have far lower success rates in answering factual questions.
The study adopted OpenAI's own SimpleQA benchmark test, which contains 4,326 areas, covering multiple fields such as science, politics and art, and each question has a clear and correct answer.

After verification by two independent reviewers, the results showed that the accuracy rate of O1-preview, the best model of OpenAI, was only 42.7%, while GPT-4o was slightly lower, only 38.2%. As for the smaller GPT-4o-mini, the accuracy rate is even only 8.6%. In contrast, Anthropic's Claude model performed worse, with the Claude-3.5-sonnet's accuracy rate being only 28.9%.

The key to this study is the design of the test, not only to test the performance of AI, but also to make everyone realize the limitations of AI models in terms of knowledge acquisition. Researchers stress that when using these models, users should regard them as information processing tools rather than relying entirely on knowledge sources. To get more accurate answers, it is best to provide reliable data for AI rather than relying solely on its built-in knowledge.

It is worth noting that AI models often estimate their own abilities. Researchers found that when these models were asked to score confidence in their responses, they usually gave an exaggerated accuracy score. In tests that repeatedly answer the same question, even if the model gives the same answer multiple times, their actual success rate is still lower than the accuracy of their self-evaluation. This is consistent with the outside world's criticism that often produces absurd answers to language models but seems confident.
Researchers believe that there is a clear gap in the factual accuracy of current AI systems and urgently need improvement. At the same time, they also asked an open question: whether AI's performance in answering short factual questions can predict how it performs when dealing with longer and more complex answers. To support the development of more reliable language models, OpenAI has published data on SimpleQA benchmarks publicly on Github.
Key points:
OpenAI research shows that the most advanced language models have low success rates when answering factual questions, with a maximum success rate of only 42.7%.
These AI models often overestimate their abilities and their confidence scores are generally exaggerated.
OpenAI has made SimpleQA benchmark public to help research more reliable language models.
This study reminds us that despite significant progress in AI technology, its limitations need to be treated with caution in practical applications and continue to promote technological improvements.