Scaling Computer-Use Grounding via User Interface Decomposition and Synthesis

About

Graphical user interface (GUI) grounding, the ability to map natural language instructions to specific actions on graphical user interfaces, remains a critical bottleneck in computer use agent development. Current benchmarks oversimplify grounding tasks as short referring expressions, failing to capture the complexity of real-world interactions that require software commonsense, layout understanding, and fine-grained manipulation capabilities. To address these limitations, we introduce OSWorld-G, a comprehensive benchmark comprising 564 finely annotated samples across diverse task types including text matching, element recognition, layout understanding, and precise manipulation. Additionally, we synthesize and release the largest computer use grounding dataset Jedi, which contains 4 million examples through multi-perspective decoupling of tasks. Our multi-scale models trained on Jedi demonstrate its effectiveness by outperforming existing approaches on ScreenSpot-v2, ScreenSpot-Pro, and our OSWorld-G. Furthermore, we demonstrate that improved grounding with Jedi directly enhances agentic capabilities of general foundation models on complex computer tasks, improving from 5% to 27% on OSWorld. Through detailed ablation studies, we identify key factors contributing to grounding performance and verify that combining specialized data for different interface elements enables compositional generalization to novel interfaces. All benchmark, data, checkpoints, and code are open-sourced and available at https://osworld-grounding.github.io.

Tianbao Xie, Jiaqi Deng, Xiaochuan Li, Junlin Yang, Haoyuan Wu, Jixuan Chen, Wenjing Hu, Xinyuan Wang, Yuhui Xu, Zekun Wang, Yiheng Xu, Junli Wang, Doyen Sahoo, Tao Yu, Caiming Xiong• 2025

Related benchmarks

Task	Dataset	Result
GUI Grounding	ScreenSpot Pro	Average Score3.95e+3	458
GUI Grounding	ScreenSpot v2	Avg Accuracy91.7	371
GUI Grounding	ScreenSpot Pro	Accuracy50.2	195
GUI Grounding	OSWorld-G	Average Score54.1	144
Grounding	ScreenSpot Pro	Average Grounding Accuracy39.5	82
GUI Grounding	UI-Vision	Average Score24.8	68
GUI Grounding	MMBench-GUI L2 (test)	Average Error70.4	67
GUI Grounding	ScreenSpot Desktop V2	Text Accuracy96.9	60
GUI Grounding	ScreenSpot Web V2	Text Accuracy94.4	60
GUI Grounding	ScreenSpot Mobile V2	Text Accuracy96.9	60

Showing 10 of 24 rows

Other info

Follow for update

@wizwand_team Discord