Anticipatory Planning for Multimodal AI Agents
About
Recent advances in multimodal agents have improved computer-use interaction and tool-usage, yet most existing systems remain reactive, optimizing actions in isolation without reasoning about future states or long-term goals. This limits planning coherence and prevents agents from reliably solving high-level, multi-step tasks. We introduce TraceR1, a two-stage reinforcement learning framework that explicitly trains anticipatory reasoning by forecasting short-horizon trajectories before execution. The first stage performs trajectory-level reinforcement learning with rewards that enforce global consistency across predicted action sequences. The second stage applies grounded reinforcement fine-tuning, using execution feedback from frozen tool agents to refine step-level accuracy and executability. TraceR1 is evaluated across seven benchmarks, covering online computer-use, offline computer-use benchmarks, and multimodal tool-use reasoning tasks, where it achieves substantial improvements in planning stability, execution robustness, and generalization over reactive and single-stage baselines. These results show that anticipatory trajectory reasoning is a key principle for building multimodal agents that can reason, plan, and act effectively in complex real-world environments.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| GUI Agent Task | AndroidWorld | Success Rate64.8 | 136 | |
| OS GUI Agentic Task Execution | OSWorld 361 tasks (Verified) | OS Success Rate41.2 | 43 | |
| Step Execution | AndroidControl High | Step Success Rate75.3 | 15 | |
| AI Agent Reasoning and Tool-use | GAIA | Level 1 Score55.9 | 15 | |
| Step Execution | GUI-Odyssey | Step Success Rate88.2 | 14 | |
| Multimodal Tool Use | GTA | Answer Accuracy56.7 | 10 | |
| Step Execution | Multimodal-Mind2Web | Step Success Rate65.3 | 8 |