Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

OpenDriveVLA: Towards End-to-end Autonomous Driving with Large Vision Language Action Model

About

We present OpenDriveVLA, a Vision Language Action model designed for end-to-end autonomous driving, built upon open-source large language models. OpenDriveVLA generates spatially grounded driving actions by leveraging multimodal inputs, including 2D and 3D instance-aware visual representations, ego vehicle states, and language commands. To bridge the modality gap between driving visual representations and language embeddings, we introduce a hierarchical vision language alignment process, projecting both 2D and 3D structured visual tokens into a unified semantic space. Furthermore, we incorporate structured agent environment ego interaction modeling into the autoregressive decoding process, enabling the model to capture fine-grained spatial dependencies and behavior-aware dynamics critical for reliable trajectory planning. Extensive experiments on the nuScenes dataset demonstrate that OpenDriveVLA achieves state-of-the-art results across open-loop trajectory planning and driving-related question answering tasks. Qualitative analyses further illustrate its capability to follow high-level driving commands and generate trajectories under challenging scenarios, highlighting its potential for next-generation end-to-end autonomous driving.

Xingcheng Zhou, Xuyuan Han, Feng Yang, Yunpu Ma, Volker Tresp, Alois Knoll• 2025

Related benchmarks

TaskDatasetResultRank
Open-loop planningnuScenes (val)
L2 Error (3s)0.55
225
Autonomous Driving PlanningNAVSIM v1
NC92.2
126
Open-loop planningnuScenes
L2 Error (Avg)0.33
121
PlanningnuScenes (val)
Collision Rate (Avg)25
97
Open-loop planningnuScenes v1.0 (val)
L2 (1s)0.14
71
Trajectory PlanningnuScenes--
58
Open-loop planningnuScenes v1.0-trainval (val)
L2 Error (Avg)0.33
54
Visual Question AnsweringNuscenesQA
Accuracy58.2
33
Visual Question AnsweringnuScenes-QA
Overall Score58.4
21
3D Question AnsweringNuscenesQA v1.0 (test)--
19
Showing 10 of 17 rows

Other info

Follow for update