Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Token Predictors Are Not Planners: Building Physically Grounded Causal Reasoners

About

Current benchmarks for embodied vision-language planning often favor linguistic next-token prediction over physically grounded next-state reasoning. This rewards models that mimic statistical language priors rather than track causal dependencies, reducing physical planning to shallow sequence modeling. We argue that reliable physical autonomy requires a shift from linguistically grounded token prediction toward physically grounded causal reasoning. To this end, we introduce Causal-Plan-Bench, a high-fidelity diagnostic suite curated through multi-stage verification to evaluate embodied planning across four causal dimensions. We also construct Causal-Plan-1M, a million-scale corpus of explicit reasoning traces produced by a four-stage annotation pipeline over egocentric videos. Extensive evaluation shows that leading models still struggle to demonstrate genuine physical agency, with Gemini 3 Pro reaching only 38.18 on our benchmark. In contrast, our training recipe enables Causal Planner, built on Qwen3-VL-8B, to internalize physical logic for more accurate next-state estimation. The model achieves strong in-domain performance and cross-benchmark generalization, and reveals a Causal Scaling Law: scaling causal training data to one million instances yields a 36.3% relative gain, from 33.22 to 45.28. Overall, our work provides a concrete step toward turning agents from superficial token predictors into physically grounded causal reasoners.

Zheng Lu, Mingqi Gao, Qinlei Xie, Wanqi Zhong, Hanwen Cui, Heng Cao, Zirui Song, Yifan Yang, Chong Luo, Bei Liu, Yiming Li• 2026

Related benchmarks

TaskDatasetResultRank
Embodied PlanningCausal-Plan-Bench in-domain
Overall Success Rate45.28
16
Next-Step-Prediction Style PlanningRoboVQA
Performance Score63.43
16
Next-Step-Prediction Style PlanningEgoPlan-Bench 2
Overall Performance Score45.32
16
Next-Step-Prediction Style PlanningCosmos Reason
Performance63.3
16
Showing 4 of 4 rows

Other info

Follow for update