General Agentic Planning Through Simulative Reasoning with World Models

About

What does it mean to plan? Current agentic systems, whether scaffolded workflows or end-to-end policies, rely on reactive decision-making: selecting the next action via a fixed procedure with at most undifferentiated adaptive computation (e.g., chain-of-thought) lacking explicit modeling of future outcomes. This limits generalizability, as each new task demands re-engineering rather than transfer of shared reasoning capacity. Humans, by contrast, plan by mentally simulating consequences of candidate actions within an internal world model, a capacity known as simulative reasoning (System II) that supports flexible, goal-directed behavior across diverse contexts. We argue that simulative reasoning through a world model provides a general-purpose planning mechanism for agentic systems, improving upon reactive policies (System I) by grounding decisions in predicted future states rather than pattern-matched responses. To verify this, we introduce SiRA (Simulative Reasoning Architecture), a goal-oriented architecture instantiating simulative reasoning using an LLM-based world model with natural-language belief states, while remaining model-agnostic. We evaluate across three qualitatively distinct task categories: constrained navigation, multi-hop information aggregation, and general instruction following, in a web-browser environment. Across all categories, simulative reasoning achieves up to 124% higher task completion rates than a matched reactive baseline, and increases constrained navigation success from 0% to 32.2% compared to a representative open-web agent. The persistent advantage across distinct task types suggests the benefit stems from generalizable counterfactual evaluation rather than task-specific tuning.

Mingkai Deng, Jinyu Hou, Zhiting Hu, Eric Xing• 2025

Related benchmarks

Task	Dataset	Result
Embodied Task Completion	EB-Habitat	Avg Success Rate48.4	63
Embodied Instruction Following	EB-ALFRED 1.0 (test)	Success Rate (Avg)45.6	20
Constrained navigation	FlightQA	Accuracy32.2	5
General Instruction Following	WebArena random 100-sample	Success Rate23	3
Multi-hop information aggregation	FanOutQA	Accuracy29.8	3

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord