Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving

About

Vision-Language-Action (VLA) models offer significant potential for end-to-end driving, yet their reasoning is often constrained by textual Chains-of-Thought (CoT). This symbolic compression of visual information creates a modality gap between perception and planning by blurring spatio-temporal relations and discarding fine-grained cues. We introduce FSDrive, a framework that empowers VLAs to "think visually" using a novel visual spatio-temporal CoT. FSDrive first operates as a world model, generating a unified future frame that combines a predicted background with explicit, physically-plausible priors like future lane dividers and 3D object boxes. This imagined scene serves as the visual spatio-temporal CoT, capturing both spatial structure and temporal evolution in a single representation. The same VLA then functions as an inverse-dynamics model to plan trajectories conditioned on current observations and this visual CoT. We enable this with a unified pre-training paradigm that expands the model's vocabulary with visual tokens and jointly optimizes for semantic understanding (VQA) and future-frame prediction. A progressive curriculum first generates structural priors to enforce physical laws before rendering the full scene. Evaluations on nuScenes and NAVSIM show FSDrive improves trajectory accuracy and reduces collisions, while also achieving competitive FID for video generation with a lightweight autoregressive model and advancing scene understanding on DriveLM. These results confirm that our visual spatio-temporal CoT bridges the perception-planning gap, enabling safer, more anticipatory autonomous driving. Code is available at https://github.com/MIV-XJTU/FSDrive.

Shuang Zeng, Xinyuan Chang, Mengwei Xie, Xinran Liu, Yifan Bai, Zheng Pan, Mu Xu, Xing Wei, Ning Guo• 2025

Related benchmarks

TaskDatasetResultRank
Open-loop planningnuScenes v1.0 (val)
L2 (1s)0.28
59
Trajectory PlanningUnified Evaluation Settings Autonomous Driving (test)
ADE5.02
14
Autonomous driving reasoning (cross-view risk object perception, action prediction, and planning)DriveLM
Accuracy71.77
10
Frame predictionnuScenes
FID10.1
8
Future frames generationBench2Drive (test)
FID9.3
8
Graph Visual Question AnsweringDriveLM GVQA
Accuracy72
7
Showing 6 of 6 rows

Other info

Follow for update