FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving
About
Vision-Language-Action (VLA) models offer significant potential for end-to-end driving, yet their reasoning is often constrained by textual Chains-of-Thought (CoT). This symbolic compression of visual information creates a modality gap between perception and planning by blurring spatio-temporal relations and discarding fine-grained cues. We introduce FSDrive, a framework that empowers VLAs to "think visually" using a novel visual spatio-temporal CoT. FSDrive first operates as a world model, generating a unified future frame that combines a predicted background with explicit, physically-plausible priors like future lane dividers and 3D object boxes. This imagined scene serves as the visual spatio-temporal CoT, capturing both spatial structure and temporal evolution in a single representation. The same VLA then functions as an inverse-dynamics model to plan trajectories conditioned on current observations and this visual CoT. We enable this with a unified pre-training paradigm that expands the model's vocabulary with visual tokens and jointly optimizes for semantic understanding (VQA) and future-frame prediction. A progressive curriculum first generates structural priors to enforce physical laws before rendering the full scene. Evaluations on nuScenes and NAVSIM show FSDrive improves trajectory accuracy and reduces collisions, while also achieving competitive FID for video generation with a lightweight autoregressive model and advancing scene understanding on DriveLM. These results confirm that our visual spatio-temporal CoT bridges the perception-planning gap, enabling safer, more anticipatory autonomous driving. Code is available at https://github.com/MIV-XJTU/FSDrive.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Open-loop planning | nuScenes v1.0 (val) | L2 (1s)0.28 | 59 | |
| Trajectory Planning | Unified Evaluation Settings Autonomous Driving (test) | ADE5.02 | 14 | |
| Autonomous driving reasoning (cross-view risk object perception, action prediction, and planning) | DriveLM | Accuracy71.77 | 10 | |
| Frame prediction | nuScenes | FID10.1 | 8 | |
| Future frames generation | Bench2Drive (test) | FID9.3 | 8 | |
| Graph Visual Question Answering | DriveLM GVQA | Accuracy72 | 7 |