UniUGP: Unifying Understanding, Generation, and Planing For End-to-end Autonomous Driving

About

Autonomous driving (AD) systems struggle in long-tail scenarios due to limited world knowledge and weak visual dynamic modeling. Existing vision-language-action (VLA)-based methods cannot leverage unlabeled videos for visual causal learning, while world model-based methods lack reasoning capabilities from large language models. In this paper, we construct multiple specialized datasets providing reasoning and planning annotations for complex scenarios. Then, a unified Understanding-Generation-Planning framework, named UniUGP, is proposed to synergize scene reasoning, future video generation, and trajectory planning through a hybrid expert architecture. By integrating pre-trained VLMs and video generation models, UniUGP leverages visual dynamics and semantic reasoning to enhance planning performance. Taking multi-frame observations and language instructions as input, it produces interpretable chain-of-thought reasoning, physically consistent trajectories, and coherent future videos. We introduce a four-stage training strategy that progressively builds these capabilities across multiple existing AD datasets, along with the proposed specialized datasets. Experiments demonstrate state-of-the-art performance in perception, reasoning, and decision-making, with superior generalization to challenging long-tail situations.

Hao Lu, Ziyang Liu, Guangfeng Jiang, Yuanfei Luo, Sheng Chen, Yangang Zhang, Ying-Cong Chen• 2025

Related benchmarks

Task	Dataset	Result
Open-loop planning	NuScenes v1.0 (test)	L2 Error (1s)0.58	50
Frame prediction	nuScenes	FID7.4	16
Motion Planning	NuScenes v1.0 (test)	L2 Error (1s)0.58	9
Graph Visual Question Answering	DriveLM GVQA	Accuracy74	7
Chain-of-Thought Reasoning	Driving Evaluation Benchmark	GPT Score0.88	5
Scene and Object Comprehension	Driving Evaluation Benchmark	Small Object Accuracy89.3	5
Short-term Driving Planning	Driving Evaluation Benchmark	L2 Error (3s)1.45	5
Trajectory following	Driving Evaluation Benchmark	L2 Error (3s Horizon)1.4	5

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord