UI-Copilot: Advancing Long-Horizon GUI Automation via Tool-Integrated Policy Optimization

About

MLLM-based GUI agents have demonstrated strong capabilities in complex user interface interaction tasks. However, long-horizon scenarios remain challenging, as these agents are burdened with tasks beyond their intrinsic capabilities, suffering from memory degradation, progress confusion, and math hallucination. To address these challenges, we present UI-Copilot, a collaborative framework where the GUI agent focuses on task execution while a lightweight copilot provides on-demand assistance for memory retrieval and numerical computation. We introduce memory decoupling to separate persistent observations from transient execution context, and train the policy agent to selectively invoke the copilot as Retriever or Calculator based on task demands. To enable effective tool invocation learning, we propose Tool-Integrated Policy Optimization (TIPO), which separately optimizes tool selection through single-turn prediction and task execution through on-policy multi-turn rollouts. Experimental results show that UI-Copilot-7B achieves state-of-the-art performance on challenging MemGUI-Bench, outperforming strong 7B-scale GUI agents such as GUI-Owl-7B and UI-TARS-1.5-7B. Moreover, UI-Copilot-7B delivers a 17.1% absolute improvement on AndroidWorld over the base Qwen model, highlighting UI-Copilot's strong generalization to real-world GUI tasks.

Zhengxi Lu, Fei Tang, Guangyi Liu, Kaitao Song, Xu Tan, Jin Ma, Wenqi Zhang, Weiming Lu, Jun Xiao, Yueting Zhuang, Yongliang Shen• 2026

Related benchmarks

Task	Dataset	Result
GUI Grounding	ScreenSpot v2	Avg Accuracy90	371
GUI Grounding	ScreenSpot Pro	--	195
Mobile GUI Automation	AndroidWorld	Overall Success Rate39.1	68
GUI Interaction Control	GUI-Odyssey	SR57.2	31
GUI Automation	AndroidControl High	Task Match (TM)82.9	27
GUI Automation	MiniWob++	Success Rate61.2	25
GUI reasoning	AndroidControl Low	SR89.2	24
GUI Understanding	AndroidControl High	Task Match Rate (TM)82.9	22
GUI Automation	GUI-Odyssey	Task Metric (TM)74.5	15
Long-horizon GUI Interaction	MemGUI-Bench	Precision@1 (1 App)42.9	14

Showing 10 of 12 rows

Other info

GitHub

Follow for update

@wizwand_team Discord