OmniDrive: A Holistic Vision-Language Dataset for Autonomous Driving with Counterfactual Reasoning
About
The advances in vision-language models (VLMs) have led to a growing interest in autonomous driving to leverage their strong reasoning capabilities. However, extending these capabilities from 2D to full 3D understanding is crucial for real-world applications. To address this challenge, we propose OmniDrive, a holistic vision-language dataset that aligns agent models with 3D driving tasks through counterfactual reasoning. This approach enhances decision-making by evaluating potential scenarios and their outcomes, similar to human drivers considering alternative actions. Our counterfactual-based synthetic data annotation process generates large-scale, high-quality datasets, providing denser supervision signals that bridge planning trajectories and language-based reasoning. Futher, we explore two advanced OmniDrive-Agent frameworks, namely Omni-L and Omni-Q, to assess the importance of vision-language alignment versus 3D perception, revealing critical insights into designing effective LLM-agents. Significant improvements on the DriveLM Q\&A benchmark and nuScenes open-loop planning demonstrate the effectiveness of our dataset and methods.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Open-loop planning | nuScenes v1.0 (val) | L2 (1s)0.14 | 59 | |
| Planning | nuScenes (val) | Collision Rate (Avg)30 | 52 | |
| Planning | nuScenes v1.0-trainval (val) | ST-P3 L2 Error (1s)0.14 | 39 | |
| Open-loop trajectory prediction | NuScenes v1.0 (test) | L2 Error (1s)0.14 | 29 | |
| Open-loop planning | NuScenes v1.0 (test) | L2 Error (1s)0.14 | 28 | |
| Open-loop planning | nuScenes | L2 Error (1s)0.14 | 20 | |
| 3D Visual Question Answering | nuScenes VQA | Accuracy0.592 | 14 | |
| Motion Planning | nuScenes | ST-P3 Collision (1s)0.04 | 11 | |
| Image Captioning | OmniDrive | CIDEr68.6 | 9 | |
| Scene Understanding | OmniDrive (test) | ROUGE-L0.326 | 8 |