Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

VLM-AD: End-to-End Autonomous Driving through Vision-Language Model Supervision

About

Human drivers rely on commonsense reasoning to navigate diverse and dynamic real-world scenarios. Existing end-to-end (E2E) autonomous driving (AD) models are typically optimized to mimic driving patterns observed in data, without capturing the underlying reasoning processes. This limitation constrains their ability to handle challenging driving scenarios. To close this gap, we propose VLM-AD, a method that leverages vision-language models (VLMs) as teachers to enhance training by providing additional supervision that incorporates unstructured reasoning information and structured action labels. Such supervision enhances the model's ability to learn richer feature representations that capture the rationale behind driving patterns. Importantly, our method does not require a VLM during inference, making it practical for real-time deployment. When integrated with state-of-the-art methods, VLM-AD achieves significant improvements in planning accuracy and reduced collision rates on the nuScenes dataset. It further improves route completion and driving scores under closed-loop evaluation, demonstrating its effectiveness in long-horizon, interactive driving scenarios and its potential for safe and reliable real-world deployment.

Yi Xu, Yuxin Hu, Zaiwei Zhang, Gregory P. Meyer, Siva Karthik Mustikovela, Siddhartha Srinivasa, Eric M. Wolff, Xin Huang• 2024

Related benchmarks

TaskDatasetResultRank
Visual Reinforcement LearningDMControl Cartpole, Swingup
Episode Return776
16
Visual Reinforcement LearningDMControl Finger, Spin
Episode Return815
16
Visual Reinforcement LearningDMControl Cheetah Run
Episode Return255
16
Visual Reinforcement LearningDMControl Ball in cup, Catch
Episode Return751
16
Visual Reinforcement LearningDMControl Reacher Easy
Episode Return224
16
Visual Reinforcement LearningDMControl Walker Walk
Episode Return209
16
Autonomous DrivingCARLA (#HW)
Error Rate113
15
Visual Reinforcement LearningCARLA (#GP scenario)
ER127
15
Visual Reinforcement LearningCarRacing v0 (test)
Environment Reward7.28e+3
11
PlanningnuScenes
L2 Error (1s)0.3
9
Showing 10 of 16 rows

Other info

Follow for update