In a professional environment, graphical user interface (GUI) agents face three key challenges. First of all, the complexity of professional applications is much higher than that of general software, and requires a deep understanding of complex layouts. These applications usually contain a large number of functional modules and complex interactive logic, requiring GUI agents to have a high degree of intelligence and adaptability. Second, professional tools usually have higher resolutions, resulting in smaller target sizes, which reduces positioning accuracy. This high resolution environment places higher demands on the accuracy of GUI agents, especially when dealing with tiny interface elements. Finally, workflows tend to rely on additional tools and documentation, adding to the complexity of operations. These challenges highlight the need to develop more advanced benchmarks and solutions to improve the performance of GUI agents in these rigorous scenarios.

Current GUI positioning models and benchmarks cannot meet the requirements of professional environments. For example, tools such as ScreenSpot are designed mainly for low-resolution tasks and lack the diversity that can accurately simulate real-life scenarios. Models such as OS-Atlas and UGround do not perform well in terms of computing efficiency, especially when the target is small or the interface icons are rich, they often fail. In addition, the lack of multilingual support also limits the application of these models in global workflows. These shortcomings further underline the need for a more comprehensive and realistic benchmark to advance this area.
To address these issues, the research teams from the National University of Singapore, East China Normal University and Hong Kong Baptist University have launched ScreenSpot-Pro, a new benchmark tailored for high-resolution professional environments. The benchmark has 1,581 task data sets from 23 industries, including development, creative tools, CAD, science platforms and office suites. It uses high resolution full-screen visuals and ensures accuracy and reality through expert annotations. ScreenSpot-Pro also provides multilingual guidance, including English and Chinese, to extend the scope of evaluation. Unlike before, ScreenSpot-Pro documents the actual workflow, ensuring the generation of high-quality annotations, thus providing effective tools for the comprehensive evaluation and development of GUI positioning models.
This dataset captures real and challenging scenes, based on high-resolution images, whose target areas account for only 0.07% of the total screen on average, showing nuance and miniaturization of GUI elements. Data is collected by professional users with extensive experience in related applications, using specialized tools to ensure the accuracy of annotations. Additionally, the dataset supports multilingual capabilities to facilitate testing of bilingual abilities and includes multiple workflows to capture the nuances of professional tasks. These features make it particularly beneficial for evaluating and improving the accuracy and flexibility of GUI agents.
Analysis of existing GUI positioning models using ScreenSpot-Pro shows that it has a serious lack of capability in dealing with high-resolution professional environments. The highest accuracy rate of OS-Atlas-7B is only 18.9%. However, ReGround, which adopts the iterative method, improves performance through the fine-tuning of the multi-step method, achieving an accuracy of 40.2%. The identification of small components such as icons shows significant difficulties, while the bilingual task further highlights the limitations of the model. These findings highlight the need to improve techniques to enhance contextual understanding and adaptability in complex GUI environments.
ScreenSpot-Pro sets a transformative benchmark for the evaluation of GUI agents in high-resolution professional environments. It addresses specific challenges in complex workflows and provides diverse and precise data sets to guide innovations in GUI positioning. This contribution will lay the foundation for smarter and more efficient agents, thereby supporting the seamless execution of professional tasks and significantly enhancing productivity and innovation in various industries.
Paper: https://likaixin2000.github.io/papers/ScreenSpot_Pro.pdf
Data |:https://huggingface.co/datasets/likaixin/ScreenSpot-Pro
Key points:
** Complexity of professional applications**: GUI agents need to handle professional software interfaces with high complexity and high resolution.
**ScreenSpot-Pro dataset**: Contains 1,581 tasks, covers 23 professional applications, and supports multilingual evaluation.
** Model performance improvement**: Through multi-step fine-tuning, improve the accuracy of the GUI positioning model in high-resolution environments.