DriveX: Omni Scene Modeling for Learning Generalizable World Knowledge in Autonomous Driving

About

Data-driven learning has advanced autonomous driving, yet task-specific models struggle with out-of-distribution scenarios due to their narrow optimization objectives and reliance on costly annotated data. We present DriveX, a self-supervised world model that learns generalizable scene dynamics and holistic representations (geometric, semantic, and motion) from large-scale driving videos. DriveX introduces Omni Scene Modeling (OSM), a module that unifies multimodal supervision-3D point cloud forecasting, 2D semantic representation, and image generation-to capture comprehensive scene evolution. To simplify learning complex dynamics, we propose a decoupled latent world modeling strategy that separates world representation learning from future state decoding, augmented by dynamic-aware ray sampling to enhance motion modeling. For downstream adaptation, we design Future Spatial Attention (FSA), a unified paradigm that dynamically aggregates spatiotemporal features from DriveX's predictions to enhance task-specific inference. Extensive experiments demonstrate DriveX's effectiveness: it achieves significant improvements in 3D future point cloud prediction over prior work, while attaining state-of-the-art results on diverse tasks including occupancy prediction, flow estimation, and end-to-end driving. These results validate DriveX's capability as a general-purpose world model, paving the way for robust and unified autonomous driving frameworks.

Chen Shi, Shaoshuai Shi, Kehua Sheng, Bo Zhang, Li Jiang• 2025

Related benchmarks

Task	Dataset	Result
Autonomous Driving	NAVSIM v1 (test)	NC97.5	147
Autonomous Driving Planning	NAVSIM v1	NC97.5	126
Autonomous Driving	NAVSIM (test)	PDMS84.5	62
End-to-end Planning	NAVSIM	NC97.5	13

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord