FlowVLA: Visual Chain of Thought-based Motion Reasoning for Vision-Language-Action Models

About

Many Vision-Language-Action (VLA) models are built upon an internal world model trained via next-frame prediction ``$v_t \rightarrow v_{t+1}$''. However, this paradigm attempts to predict the future frame's appearance directly, without explicitly reasoning about the underlying dynamics. \textbf{This lack of an explicit motion reasoning step} often leads to physically implausible visual forecasts and inefficient policy learning. To address this limitation, we introduce the \textbf{Visual Chain of Thought (Visual CoT)}, a paradigm that compels the model to first reason about \textbf{motion dynamics} before generating the future frame. We instantiate this paradigm by proposing \textbf{FlowVLA}, an autoregressive Transformer that explicitly materializes this reasoning process as ``$v_t \rightarrow f_t \rightarrow v_{t+1}$'', where $f_t$ is an intermediate optical flow prediction that inherently encodes motion. By forcing the model to first follow the motion plan encoded by $f_t$, this process inherently \textbf{aligns the pre-training objective of dynamics prediction with the downstream task of action generation.} We conduct experiments on challenging robotics manipulation benchmarks, as well as real-robot evaluations. Our FlowVLA not only generates \textbf{more coherent and physically plausible visual predictions}, but also achieves state-of-the-art policy performance with \textbf{substantially improved sample efficiency}, pointing toward a more principled foundation for world modeling in VLAs. Project page: https://irpn-lab.github.io/FlowVLA/

Zhide Zhong, Haodong Yan, Junfeng Li, Xiangchen Liu, Xin Gong, Tianran Zhang, Wenxuan Song, Jiayi Chen, Xinhu Zheng, Hesheng Wang, Haoang Li• 2025

Related benchmarks

Task	Dataset	Result
Robot Manipulation	LIBERO	Object Achievement95	1025
Robotic Manipulation	LIBERO	Spatial Success Rate93.2	570
Robot Manipulation	LIBERO (test)	Average Success Rate88.1	237
Robotic Manipulation	LIBERO	Long-horizon Success Rate72.6	165
Robot Manipulation	SimplerEnv WidowX	Overall Success Rate74	123
Robotic Manipulation	LIBERO v1 (test)	Average Success Rate88.1	118
Robotic Manipulation	LIBERO Spatial Object Goal Long	Overall Success Rate (Long)72.6	91
Robot Manipulation	LIBERO	Spatial Success Rate93.2	46
Language-conditioned manipulation	LIBERO	Spatial Success Rate93.2	18

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord