OSExpert: Computer-Use Agents Learning Professional Skills via Exploration

About

General-purpose computer-use agents have shown impressive performance across diverse digital environments. However, our new benchmark, OSExpert-Eval, indicates they remain far less helpful than human experts. Although inference-time scaling enables adaptation, these agents complete complex tasks inefficiently with degraded performance, transfer poorly to unseen UIs, and struggle with fine-grained action sequences. To solve the problem, we introduce a GUI-based depth-first search (GUI-DFS) exploration algorithm to comprehensively explore and verify an environment's unit functions. The agent then exploits compositionality between unit skills to self-construct a curriculum for composite tasks. To support fine-grained actions, we curate a database of action primitives for agents to discover during exploration; these are saved as a skill set once the exploration is complete. We use the learned skills to improve the agent's performance and efficiency by (1) enriching agents with ready-to-use procedural knowledge, allowing them to plan only once for long trajectories and generate accurate actions, and (2) enabling them to end inference-time scaling earlier by realizing their boundary of capabilities. Extensive experiments show that our environment-learned agent takes a meaningful step toward expert-level computer use, achieving a around 20 percent performance gain on OSExpert-Eval and closing the efficiency gap to humans by around 80 percent

Jiateng Liu, Zhenhailong Wang, Rushi Wang, Bingxuan Li, Jeonghwan Kim, Aditi Tiwari, Pengfei Yu, Denghui Zhang, Heng Ji• 2026

Related benchmarks

Task	Dataset	Result
GUI Agent Task Completion	OSWorld 1.0 (test)	--	42
Fine-Grained Action Execution	OSExpert-Eval	GIMP Execution Time (s)35	10
Long-Horizon Composite Skills	OSExpert-Eval	Execution Time (GIMP)32	10
Unseen UI Generalization	OSExpert-Eval	Execution Time (Tableau, s)28	10
Fine-Grained Action Execution	OSExpert-Eval	GIMP Success Rate28	8
Long-Horizon Composite Skills	OSExpert-Eval	GIMP Success Rate33	8
Unseen UI Generalization	OSExpert-Eval	Tableau Success Rate25	8

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord