InsightDrive: Insight Scene Representation for End-to-End Autonomous Driving

About

Conventional end-to-end autonomous driving methods often rely on explicit global scene representations, which typically consist of 3D object detection, online mapping, and motion prediction. In contrast, human drivers selectively attend to task-relevant regions and implicitly reason over the broader traffic context. Motivated by this observation, we introduce a lightweight end-to-end autonomous driving framework, InsightDrive. Unlike approaches that directly embed large language models (LLMs), InsightDrive introduces an Insight scene representation that jointly models attention-centric explicit scene representation and reasoning-centric implicit scene representation, so that scene understanding aligns more closely with human cognitive patterns for trajectory planning. To this end, we employ Chain-of-Thought (CoT) instructions to model human driving cognition and design a task-level Mixture-of-Experts (MoE) adapter that injects this knowledge into the autonomous driving model at negligible parameter cost. We further condition the planner on both explicit and implicit scene representations and employ a diffusion-based generative policy, which produces robust trajectory predictions and decisions. The overall framework establishes a knowledge distillation pipeline that transfers human driving knowledge to LLMs and subsequently to onboard models. Extensive experiments on the nuScenes and Navsim benchmarks demonstrate that InsightDrive achieves significant improvements over conventional scene representation approaches.

Ruiqi Song, Xianda Guo, Yanlun Peng, Qinggong Wei, Hangbin Wu, Long Chen• 2025

Related benchmarks

Task	Dataset	Result
Open-loop planning	nuScenes (val)	L2 Error (3s)0.68	225
Trajectory Planning	nuScenes	--	58
Motion Planning	nuScenes	L2 Error (1s)0.23	15
End-to-end Motion Planning	nuScenes v1.0 (val)	ST-P3 Collision Rate (1s)0.09	9

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord