Self-Improving World Modelling with Latent Actions

About

Internal modelling of the world -- predicting transitions between previous states $X$ and next states $Y$ under actions $Z$ -- is essential to reasoning and planning for LLMs and VLMs. Learning such models typically requires costly action-labelled trajectories. We propose SWIRL, a self-improvement framework that learns from state-only sequences by treating actions as a latent variable and alternating between Forward World Modelling (FWM) $P_\theta(Y|X,Z)$ and an Inverse Dynamics Modelling (IDM) $Q_\phi(Z|X,Y)$. SWIRL iterates two phases: (1) Variational Information Maximisation, which updates the FWM to generate next states that maximise conditional mutual information with latent actions given prior states, encouraging identifiable consistency; and (2) ELBO Maximisation, which updates the IDM to explain observed transitions, effectively performing coordinate ascent. Both models are trained with reinforcement learning (specifically, GRPO) with the opposite frozen model's log-probability as a reward signal. We provide theoretical learnability guarantees for both updates, and evaluate SWIRL on LLMs and VLMs across multiple environments: single-turn and multi-turn open-world visual dynamics and synthetic textual environments for physics, web, and tool calling. SWIRL achieves gains of 16% on AURORABench, 28% on ByteMorph, 16% on WorldPredictionBench, and 14% on StableToolBench.

Yifu Qiu, Zheng Zhao, Waylon Li, Yftah Ziser, Anna Korhonen, Shay B. Cohen, Edoardo M. Ponti• 2026

Related benchmarks

Task	Dataset	Result
Visual World Modelling	Kubric	GPT-4o Score7.45	18
Visual World Modelling	Action Genome	GPT-4o Score3.58	18
Visual World Modelling	Something-Something	GPT-4o Score3.96	18
Visual World Modelling	AURORA-BENCH Average	GPT-4o Score5.06	18
Visual World Modelling	MagicBrush	GPT-4o Score6.62	18
Visual World Modelling	WhatsUp	GPT-4o Score4.46	18
Visual Dynamics Prediction	ByteMorph	Camera Zoom57.37	9
Long-Horizon World Modelling	WorldPredictionBench v1 (test)	COIN2.61	9
API Execution Simulation	StableToolBench	ID High Success Rate16.47	8
Textual Environment Interaction	Mind2Web	BS92.44	8

Showing 10 of 11 rows

Other info

Follow for update

@wizwand_team Discord