DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models
About
A primary hurdle of autonomous driving in urban environments is understanding complex and long-tail scenarios, such as challenging road conditions and delicate human behaviors. We introduce DriveVLM, an autonomous driving system leveraging Vision-Language Models (VLMs) for enhanced scene understanding and planning capabilities. DriveVLM integrates a unique combination of reasoning modules for scene description, scene analysis, and hierarchical planning. Furthermore, recognizing the limitations of VLMs in spatial reasoning and heavy computational requirements, we propose DriveVLM-Dual, a hybrid system that synergizes the strengths of DriveVLM with the traditional autonomous driving pipeline. Experiments on both the nuScenes dataset and our SUP-AD dataset demonstrate the efficacy of DriveVLM and DriveVLM-Dual in handling complex and unpredictable driving conditions. Finally, we deploy the DriveVLM-Dual on a production vehicle, verifying it is effective in real-world autonomous driving environments.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Open-loop planning | nuScenes v1.0 (val) | L2 (1s)0.15 | 59 | |
| Planning | nuScenes v1.0-trainval (val) | ST-P3 L2 Error (1s)0.15 | 39 | |
| Open-loop trajectory prediction | NuScenes v1.0 (test) | L2 Error (1s)0.15 | 29 | |
| Open-loop planning | NuScenes v1.0 (test) | L2 Error (1s)0.15 | 28 | |
| Open-loop planning | nuScenes | L2 Error (1s)0.15 | 20 | |
| Trajectory Planning | nuScenes | ST-P3 L2 Error (1s)0.18 | 12 | |
| Motion Planning | nuScenes | ST-P3 Collision (1s)0.1 | 11 | |
| End-to-end Motion Planning | nuScenes v1.0 (val) | ST-P3 Collision Rate (1s)0.1 | 9 | |
| Trajectory Prediction | RoboDriveBench 1.0 (test) | L2 Error (Clean)0.69 | 7 | |
| Collision Robustness Evaluation | RoboDriveBench | Clean Avg Collision0.29 | 7 |