DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

About

A primary hurdle of autonomous driving in urban environments is understanding complex and long-tail scenarios, such as challenging road conditions and delicate human behaviors. We introduce DriveVLM, an autonomous driving system leveraging Vision-Language Models (VLMs) for enhanced scene understanding and planning capabilities. DriveVLM integrates a unique combination of reasoning modules for scene description, scene analysis, and hierarchical planning. Furthermore, recognizing the limitations of VLMs in spatial reasoning and heavy computational requirements, we propose DriveVLM-Dual, a hybrid system that synergizes the strengths of DriveVLM with the traditional autonomous driving pipeline. Experiments on both the nuScenes dataset and our SUP-AD dataset demonstrate the efficacy of DriveVLM and DriveVLM-Dual in handling complex and unpredictable driving conditions. Finally, we deploy the DriveVLM-Dual on a production vehicle, verifying it is effective in real-world autonomous driving environments.

Xiaoyu Tian, Junru Gu, Bailin Li, Yicheng Liu, Yang Wang, Zhiyong Zhao, Kun Zhan, Peng Jia, Xianpeng Lang, Hang Zhao• 2024

Related benchmarks

Task	Dataset	Result
Open-loop planning	nuScenes (val)	L2 Error (3s)0.68	225
Open-loop planning	nuScenes	L2 Error (Avg)0.31	121
Planning	nuScenes (val)	--	97
Open-loop planning	nuScenes v1.0 (val)	L2 (1s)0.15	71
Trajectory Planning	nuScenes	L2 Error (m) (1s)0.15	58
Open-loop planning	NuScenes v1.0 (test)	L2 Error (1s)0.15	50
Planning	nuScenes v1.0-trainval (val)	ST-P3 L2 Error (1s)0.15	39
Open-loop trajectory prediction	NuScenes v1.0 (test)	L2 Error (1s)0.15	29
End-to-end Planning	nuScenes (open-loop)	L2 Error (1s)0.15	24
Open-loop planning	nuScenes open-loop evaluation	L2 Error (1s) (m)0.15	18

Showing 10 of 19 rows

Other info

Follow for update

@wizwand_team Discord