FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving

About

Vision-Language-Action (VLA) models offer significant potential for end-to-end driving, yet their reasoning is often constrained by textual Chains-of-Thought (CoT). This symbolic compression of visual information creates a modality gap between perception and planning by blurring spatio-temporal relations and discarding fine-grained cues. We introduce FSDrive, a framework that empowers VLAs to "think visually" using a novel visual spatio-temporal CoT. FSDrive first operates as a world model, generating a unified future frame that combines a predicted background with explicit, physically-plausible priors like future lane dividers and 3D object boxes. This imagined scene serves as the visual spatio-temporal CoT, capturing both spatial structure and temporal evolution in a single representation. The same VLA then functions as an inverse-dynamics model to plan trajectories conditioned on current observations and this visual CoT. We enable this with a unified pre-training paradigm that expands the model's vocabulary with visual tokens and jointly optimizes for semantic understanding (VQA) and future-frame prediction. A progressive curriculum first generates structural priors to enforce physical laws before rendering the full scene. Evaluations on nuScenes and NAVSIM show FSDrive improves trajectory accuracy and reduces collisions, while also achieving competitive FID for video generation with a lightweight autoregressive model and advancing scene understanding on DriveLM. These results confirm that our visual spatio-temporal CoT bridges the perception-planning gap, enabling safer, more anticipatory autonomous driving. Code is available at https://github.com/MIV-XJTU/FSDrive.

Shuang Zeng, Xinyuan Chang, Mengwei Xie, Xinran Liu, Yifan Bai, Zheng Pan, Mu Xu, Xing Wei, Ning Guo• 2025

Related benchmarks

Task	Dataset	Result
Open-loop planning	nuScenes (val)	L2 Error (3s)0.56	225
Autonomous Driving Planning	NAVSIM v1	NC98.2	126
Open-loop planning	nuScenes	L2 Error (Avg)0.45	121
Autonomous Driving Planning	NAVSIM v1 (test)	NC98.2	118
Open-loop planning	nuScenes v1.0 (val)	L2 (1s)0.28	71
Trajectory Planning	nuScenes	L2 Error (m) (1s)0.14	58
Planning	NAVSIM v1	PDMS85.1	23
Closed-loop Planning	NAVSIM v1 (test)	PDMS85.1	20
Frame prediction	nuScenes	FID10.1	16
Autonomous Driving Planning	Nav. (test)	NC98.2	14

Showing 10 of 15 rows

Other info

Follow for update

@wizwand_team Discord