Building Autonomous GUI Navigation via Agentic-Q Estimation and Step-Wise Policy Optimization

About

Recent advances in Multimodal Large Language Models (MLLMs) have substantially driven the progress of autonomous agents for Graphical User Interface (GUI). Nevertheless, in real-world applications, GUI agents are often faced with non-stationary environments, leading to high computational costs for data curation and policy optimization. In this report, we introduce a novel MLLM-centered framework for GUI agents, which consists of two components: agentic-Q estimation and step-wise policy optimization. The former one aims to optimize a Q-model that can generate step-wise values to evaluate the contribution of a given action to task completion. The latter one takes step-wise samples from the state-action trajectory as inputs, and optimizes the policy via reinforcement learning with our agentic-Q model. It should be noticed that (i) all state-action trajectories are produced by the policy itself, so that the data collection costs are manageable; (ii) the policy update is decoupled from the environment, ensuring stable and efficient optimization. Empirical evaluations show that our framework endows Ovis2.5-9B with powerful GUI interaction capabilities, achieving remarkable performances on GUI navigation and grounding benchmarks and even surpassing contenders with larger scales.

Yibo Wang, Guangda Huzhang, Yuwei Hu, Yu Xia, Shiyin Lu, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, Lijun Zhang• 2026

Related benchmarks

Task	Dataset	Result
GUI Grounding	ScreenSpot v2	Avg Accuracy92.45	371
GUI Grounding	ScreenSpot V1	Average Accuracy88.68	39
GUI Navigation	Mind2Web Online (Average)	Success Rate64	13
GUI Navigation	WebVoyager	Success Rate (Allrecipes)88.89	12
GUI Navigation	Online-Mind2Web (Easy)	Success Rate78.31	12
GUI Navigation	Online-Mind2Web (Medium)	Success Rate65.73	9
GUI Navigation	Online-Mind2Web (Hard)	Success Rate51.35	9

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord