OpenDriveVLA: Towards End-to-end Autonomous Driving with Large Vision Language Action Model
About
We present OpenDriveVLA, a Vision Language Action model designed for end-to-end autonomous driving, built upon open-source large language models. OpenDriveVLA generates spatially grounded driving actions by leveraging multimodal inputs, including 2D and 3D instance-aware visual representations, ego vehicle states, and language commands. To bridge the modality gap between driving visual representations and language embeddings, we introduce a hierarchical vision language alignment process, projecting both 2D and 3D structured visual tokens into a unified semantic space. Furthermore, we incorporate structured agent environment ego interaction modeling into the autoregressive decoding process, enabling the model to capture fine-grained spatial dependencies and behavior-aware dynamics critical for reliable trajectory planning. Extensive experiments on the nuScenes dataset demonstrate that OpenDriveVLA achieves state-of-the-art results across open-loop trajectory planning and driving-related question answering tasks. Qualitative analyses further illustrate its capability to follow high-level driving commands and generate trajectories under challenging scenarios, highlighting its potential for next-generation end-to-end autonomous driving.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Open-loop planning | nuScenes | L2 Error (Avg)0.33 | 103 | |
| Planning | nuScenes (val) | Collision Rate (Avg)25 | 80 | |
| Open-loop planning | nuScenes v1.0 (val) | L2 (1s)0.14 | 71 | |
| Trajectory Planning | nuScenes | ST-P3 L2 Error (1s)0.14 | 49 | |
| 3D Question Answering | NuscenesQA v1.0 (test) | -- | 19 | |
| Motion Planning | nuScenes | L2 Error (1s)0.14 | 15 | |
| Open-loop trajectory prediction | nuScenes | L2 Error (m)0.33 | 14 | |
| Open-loop Evaluation | nuScenes | L2 Average Error (1s, m)0.15 | 10 | |
| Trajectory Generation | CARLA FPV 1000 samples (test) | Score0.13 | 10 | |
| Text Understanding | nuScenes-QA 1.0 (val) | Existence Accuracy84.2 | 8 |