WorldGUI: An Interactive Benchmark for Desktop GUI Automation from Any Starting Point

About

Recent progress in GUI agents has substantially improved visual grounding, yet robust planning remains challenging, particularly when the environment deviates from a canonical initial state. In real applications, users often invoke assistance mid-workflow, where software may be partially configured, steps may have been executed in different orders, or the interface may differ from its default setup. Such task-state variability is pervasive but insufficiently evaluated in existing GUI benchmarks. To address this gap, we introduce WorldGUI, a benchmark covering ten widely used desktop and web applications with tasks instantiated under diverse, systematically constructed initial states. These variations capture realistic human-computer interaction settings and enable diagnostic evaluation of an agent's ability to recover, adapt plans, and handle non-default contexts. We further present WorldGUI-Agent, a simple and model-agnostic framework that organizes planning and execution around three critique stages, improving reliability in dynamic environments. Experiments demonstrate that state-of-the-art GUI agents exhibit substantial performance degradation under non-default initial conditions, revealing limited robustness and fragile planning behaviors. Our benchmark and framework provide a foundation for developing more adaptable and reliable GUI agents. The code and data are available at https://github.com/showlab/WorldGUI.

Henry Hengyuan Zhao, Kaiming Yang, Wendi Yu, Difei Gao, Mike Zheng Shou• 2025

Related benchmarks

Task	Dataset	Result
GUI Automation	WorldGUI Meta 1.0	Success Rate (Office)64.4	11
GUI Automation	WorldGUI Augmented 1.0	Success Rate (Office)37.6	11
GUI Automation	WindowsAgentArena	Success Rate (Office)7	11

Showing 3 of 3 rows

Other info

Follow for update

@wizwand_team Discord