SOLVE: Synergy of Language-Vision and End-to-End Networks for Autonomous Driving
About
The integration of Vision-Language Models (VLMs) into autonomous driving systems has shown promise in addressing key challenges such as learning complexity, interpretability, and common-sense reasoning. However, existing approaches often struggle with efficient integration and realtime decision-making due to computational demands. In this paper, we introduce SOLVE, an innovative framework that synergizes VLMs with end-to-end (E2E) models to enhance autonomous vehicle planning. Our approach emphasizes knowledge sharing at the feature level through a shared visual encoder, enabling comprehensive interaction between VLM and E2E components. We propose a Trajectory Chain-of-Thought (T-CoT) paradigm, which progressively refines trajectory predictions, reducing uncertainty and improving accuracy. By employing a temporal decoupling strategy, SOLVE achieves efficient cooperation by aligning high-quality VLM outputs with E2E real-time performance. Evaluated on the nuScenes dataset, our method demonstrates significant improvements in trajectory prediction accuracy, paving the way for more robust and reliable autonomous driving systems.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Open-loop planning | nuScenes v1.0 (val) | L2 (1s)0.13 | 59 | |
| Planning | nuScenes v1.0-trainval (val) | ST-P3 L2 Error (1s)0.13 | 39 | |
| Open-loop planning | NuScenes v1.0 (test) | L2 Error (1s)0.13 | 28 | |
| Open-loop planning | nuScenes | L2 Error (1s)0.13 | 20 |