OS-Symphony: A Holistic Framework for Robust and Generalist Computer-Using Agent
About
While Vision-Language Models (VLMs) have significantly advanced Computer-Using Agents (CUAs), current frameworks struggle with robustness in long-horizon workflows and generalization in novel domains. These limitations stem from a lack of granular control over historical visual context curation and the absence of visual-aware tutorial retrieval. To bridge these gaps, we introduce OS-Symphony, a holistic framework that comprises an Orchestrator coordinating two key innovations for robust automation: (1) a Reflection-Memory Agent that utilizes milestone-driven long-term memory to enable trajectory-level self-correction, effectively mitigating visual context loss in long-horizon tasks; (2) Versatile Tool Agents featuring a Multimodal Searcher that adopts a SeeAct paradigm to navigate a browser-based sandbox to synthesize live, visually aligned tutorials, thereby resolving fidelity issues in unseen scenarios. Experimental results demonstrate that OS-Symphony delivers substantial performance gains across varying model scales, establishing new state-of-the-art results on three online benchmarks, notably achieving 65.84% on OSWorld.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| GUI Navigation and Action | OS World (test) | Success Rate (OS)78.26 | 26 | |
| OS GUI Agentic Task Execution | OSWorld 361 tasks (Verified) | Average Success Rate65.84 | 21 | |
| GUI Automation | WindowsAgentArena | Success Rate (Office)54.76 | 11 | |
| Operating System Task Automation | MacOSArena | Single App Score32.14 | 9 | |
| Operating System Agent Control | WindowsAgentArena | Success Rate0.635 | 8 |