Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

OSExpert: Computer-Use Agents Learning Professional Skills via Exploration

About

General-purpose computer-use agents have shown impressive performance across diverse digital environments. However, our new benchmark, OSExpert-Eval, indicates they remain far less helpful than human experts. Although inference-time scaling enables adaptation, these agents complete complex tasks inefficiently with degraded performance, transfer poorly to unseen UIs, and struggle with fine-grained action sequences. To solve the problem, we introduce a GUI-based depth-first search (GUI-DFS) exploration algorithm to comprehensively explore and verify an environment's unit functions. The agent then exploits compositionality between unit skills to self-construct a curriculum for composite tasks. To support fine-grained actions, we curate a database of action primitives for agents to discover during exploration; these are saved as a skill set once the exploration is complete. We use the learned skills to improve the agent's performance and efficiency by (1) enriching agents with ready-to-use procedural knowledge, allowing them to plan only once for long trajectories and generate accurate actions, and (2) enabling them to end inference-time scaling earlier by realizing their boundary of capabilities. Extensive experiments show that our environment-learned agent takes a meaningful step toward expert-level computer use, achieving a around 20 percent performance gain on OSExpert-Eval and closing the efficiency gap to humans by around 80 percent

Jiateng Liu, Zhenhailong Wang, Rushi Wang, Bingxuan Li, Jeonghwan Kim, Aditi Tiwari, Pengfei Yu, Denghui Zhang, Heng Ji• 2026

Related benchmarks

TaskDatasetResultRank
GUI Agent Task CompletionOSWorld 1.0 (test)
Success Rate (GIMP)30.8
20
Fine-Grained Action ExecutionOSExpert-Eval
GIMP Execution Time (s)35
10
Long-Horizon Composite SkillsOSExpert-Eval
Execution Time (GIMP)32
10
Unseen UI GeneralizationOSExpert-Eval
Execution Time (Tableau, s)28
10
Fine-Grained Action ExecutionOSExpert-Eval
GIMP Success Rate28
8
Long-Horizon Composite SkillsOSExpert-Eval
GIMP Success Rate33
8
Unseen UI GeneralizationOSExpert-Eval
Tableau Success Rate25
8
Showing 7 of 7 rows

Other info

Follow for update