InsightDrive: Insight Scene Representation for End-to-End Autonomous Driving
About
Conventional end-to-end autonomous driving methods often rely on explicit global scene representations, which typically consist of 3D object detection, online mapping, and motion prediction. In contrast, human drivers selectively attend to task-relevant regions and implicitly reason over the broader traffic context. Motivated by this observation, we introduce a lightweight end-to-end autonomous driving framework, InsightDrive. Unlike approaches that directly embed large language models (LLMs), InsightDrive introduces an Insight scene representation that jointly models attention-centric explicit scene representation and reasoning-centric implicit scene representation, so that scene understanding aligns more closely with human cognitive patterns for trajectory planning. To this end, we employ Chain-of-Thought (CoT) instructions to model human driving cognition and design a task-level Mixture-of-Experts (MoE) adapter that injects this knowledge into the autonomous driving model at negligible parameter cost. We further condition the planner on both explicit and implicit scene representations and employ a diffusion-based generative policy, which produces robust trajectory predictions and decisions. The overall framework establishes a knowledge distillation pipeline that transfers human driving knowledge to LLMs and subsequently to onboard models. Extensive experiments on the nuScenes and Navsim benchmarks demonstrate that InsightDrive achieves significant improvements over conventional scene representation approaches.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Trajectory Planning | nuScenes | ST-P3 L2 Error (1s)0.23 | 12 | |
| Motion Planning | nuScenes | ST-P3 Collision (1s)0.09 | 11 | |
| End-to-end Motion Planning | nuScenes v1.0 (val) | ST-P3 Collision Rate (1s)0.09 | 9 |