OpenAI latest benchmark test: AI programming capabilities reach one-quarter of humans, showing limitations - AI Articles

Author：Eve Cole Update Time：2025-05-27 12:25:02

OpenAI recently released an assessment report on AI programming capabilities, revealing the current status of AI in the field of software development through a $1 million actual development project. The benchmark, called SWE-Lancer, covers 1,400 real projects from Upwork, comprehensively assesses AI's performance in both direct development and project management. This test not only demonstrates the potential of AI in programming tasks, but also provides an important reference for future technological development.

Test results show that the best performing AI model, Claude3.5Sonnet, had a success rate of 26.2% in coding tasks and 44.9% in project management decision-making. Although this achievement is still far from that of human developers, it has shown considerable potential in terms of economic benefits. Especially in the public Diamond dataset, the model is able to complete $208,050 in project development. If extended to a full dataset, AI is expected to handle tasks worth more than $400,000, which provides the possibility for enterprises to save a lot of cost in software development.

However, research also reveals the obvious limitations of AI in complex development tasks. Although AI is competent for simple bug fixes, such as fixing redundant API calls, it performs poorly when facing complex projects that require in-depth understanding and comprehensive solutions, such as cross-platform video playback feature development. It is particularly noteworthy that AI can often identify problem codes, but it is difficult to understand the root cause and provide comprehensive solutions. This shows that the application of AI in software development still requires further technological breakthroughs.

To promote research in this field, OpenAI has open sourced the SWE-Lancer Diamond dataset and related tools on GitHub, allowing researchers to evaluate the performance of various programming models based on unified standards. This move not only provides an important reference for the further improvement of AI programming capabilities, but also provides valuable resources for the global developer community and promotes common technological progress.