From Word to World: Can Large Language Models be Implicit Text-based World Models?

About

Agentic reinforcement learning increasingly relies on experience-driven scaling, yet real-world environments remain non-adaptive, limited in coverage, and difficult to scale. World models offer a potential way to improve learning efficiency through simulated experience, but it remains unclear whether large language models can reliably serve this role and under what conditions they meaningfully benefit agents. We study these questions in text-based environments, which provide a controlled setting to reinterpret language modeling as next-state prediction under interaction. We introduce a three-level framework for evaluating LLM-based world models: (i) fidelity and consistency, (ii) scalability and robustness, and (iii) agent utility. Across five representative environments, we find that sufficiently trained world models maintain coherent latent state, scale predictably with data and model size, and improve agent performance via action verification, synthetic trajectory generation, and warm-starting reinforcement learning. Meanwhile, these gains depend critically on behavioral coverage and environment complexity, delineating clear boundry on when world modeling effectively supports agent learning.

Yixia Li, Hongru Wang, Jiahao Qiu, Zhenfei Yin, Dongdong Zhang, Cheng Qian, Zeping Li, Pony Ma, Guanhua Chen, Heng Ji• 2025

Related benchmarks

Task	Dataset	Result
Web Navigation and Shopping	Webshop	--	248
Instruction Following	ALFWorld (val seen)	Success Rate (SR)82.14	65
Interactive Instruction Following	ALFWorld Unseen	Success Rate79.5	54
Science Simulation Task Completion	ScienceWorld Unseen	Success Rate54.3	28
Science Simulation Task Completion	ScienceWorld Seen	Success Rate59.27	28
Tool Use	StableToolBench	--	28
One-step next-observation prediction	ALFWorld (test)	Token F189	16
One-step next-observation prediction	BabyAI (test)	Token F193	16
One-step next-observation prediction	SciWorld (test)	Token F196	16
One-step next-observation prediction	WebShop (test)	Token F163	16

Showing 10 of 25 rows

Other info

GitHub

Follow for update

@wizwand_team Discord