Impromptu VLA: Open Weights and Open Data for Driving Vision-Language-Action Models
About
Vision-Language-Action (VLA) models for autonomous driving show promise but falter in unstructured corner case scenarios, largely due to a scarcity of targeted benchmarks. To address this, we introduce Impromptu VLA. Our core contribution is the Impromptu VLA Dataset: over 80,000 meticulously curated video clips, distilled from over 2M source clips sourced from 8 open-source large-scale datasets. This dataset is built upon our novel taxonomy of four challenging unstructured categories and features rich, planning-oriented question-answering annotations and action trajectories. Crucially, experiments demonstrate that VLAs trained with our dataset achieve substantial performance gains on established benchmarks--improving closed-loop NeuroNCAP scores and collision rates, and reaching near state-of-the-art L2 accuracy in open-loop nuScenes trajectory prediction. Furthermore, our Q&A suite serves as an effective diagnostic, revealing clear VLM improvements in perception, prediction, and planning. Our code, data and models are available at https://github.com/ahydchh/Impromptu-VLA.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Open-loop planning | nuScenes | L2 Error (Avg)0.3 | 103 | |
| Open-loop trajectory prediction | NuScenes v1.0 (test) | L2 Error (1s)0.13 | 29 | |
| Closed-loop simulation | NeuroNCAP | NeuroNCAP Score (Avg)2.15 | 21 | |
| Open-loop trajectory prediction | nuScenes | L2 Error (m)0.33 | 14 | |
| Traffic Question Answering | Proposed Driving Benchmark | T-QA Score46.3 | 10 | |
| Noticeable Object Perception and Reasoning | Proposed Driving Benchmark | NoPR32.2 | 10 | |
| Scene Description | Proposed Driving Benchmark | SD Score60.8 | 10 | |
| Knowledge Retention | Proposed Driving Benchmark | KRR68.4 | 9 |