Building Autonomous GUI Navigation via Agentic-Q Estimation and Step-Wise Policy Optimization
About
Recent advances in Multimodal Large Language Models (MLLMs) have substantially driven the progress of autonomous agents for Graphical User Interface (GUI). Nevertheless, in real-world applications, GUI agents are often faced with non-stationary environments, leading to high computational costs for data curation and policy optimization. In this report, we introduce a novel MLLM-centered framework for GUI agents, which consists of two components: agentic-Q estimation and step-wise policy optimization. The former one aims to optimize a Q-model that can generate step-wise values to evaluate the contribution of a given action to task completion. The latter one takes step-wise samples from the state-action trajectory as inputs, and optimizes the policy via reinforcement learning with our agentic-Q model. It should be noticed that (i) all state-action trajectories are produced by the policy itself, so that the data collection costs are manageable; (ii) the policy update is decoupled from the environment, ensuring stable and efficient optimization. Empirical evaluations show that our framework endows Ovis2.5-9B with powerful GUI interaction capabilities, achieving remarkable performances on GUI navigation and grounding benchmarks and even surpassing contenders with larger scales.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| GUI Grounding | ScreenSpot v2 | Avg Accuracy92.45 | 203 | |
| GUI Grounding | ScreenSpot V1 | Mobile Text Accuracy95.24 | 15 | |
| GUI Navigation | WebVoyager | Success Rate (Allrecipes)88.89 | 12 | |
| GUI Navigation | Mind2Web Online (Average) | Success Rate64 | 10 | |
| GUI Navigation | Online-Mind2Web (Easy) | Success Rate78.31 | 9 | |
| GUI Navigation | Online-Mind2Web (Medium) | Success Rate65.73 | 9 | |
| GUI Navigation | Online-Mind2Web (Hard) | Success Rate51.35 | 9 |