Geometry Forcing: Marrying Video Diffusion and 3D Representation for Consistent World Modeling

About

Videos inherently represent 2D projections of a dynamic 3D world. However, our analysis suggests that video diffusion models trained solely on raw video data often fail to capture meaningful geometric-aware structure in their learned representations. To bridge the gap between video diffusion models and the underlying 3D nature of the physical world, we propose Geometry Forcing, a simple yet effective method that encourages video diffusion models to internalize 3D representations. Our key insight is to guide the model's intermediate representations toward geometry-aware structure by aligning them with features from a geometric foundation model. To this end, we introduce two complementary alignment objectives: Angular Alignment, which enforces directional consistency via cosine similarity, and Scale Alignment, which preserves scale-related information by regressing geometric features from normalized diffusion representations. We evaluate Geometry Forcing on both camera-view conditioned and action-conditioned video generation tasks. Experimental results demonstrate that our method substantially improves visual quality and 3D consistency over the baseline methods. Project page: https://GeometryForcing.github.io.

Haoyu Wu, Diankun Wu, Tianyu He, Junliang Guo, Yang Ye, Yueqi Duan, Jiang Bian• 2025

Related benchmarks

Task	Dataset	Result
Single-image world generation	WorldScore Indoor	3D Consistency78.05	7
Single-image world generation	DL3DV	3D Consistency67.12	7
1-view-based novel view generation	RealEstate10K	PSNR15.97	7
1-view-based novel view generation	DL3DV-10K	PSNR11.62	7
Camera-view-conditioned video generation	RealEstate10K long-term (256-Frame)	FVD237	6
Video Generation	RealEstate10K 0~64 frames (test)	PSNR16.37	6
Video Generation	RealEstate10K 0~128 frames (test)	PSNR12.69	6
Video Generation	RealEstate10K 0~200 frames (test)	PSNR10.59	6
Video Generation	RealEstate10K >=256 frames (test)	PSNR9.91	6
Camera-view-conditioned video generation	RealEstate10K short-term (16-Frame)	FVD179	5

Showing 10 of 13 rows

Other info

Follow for update

@wizwand_team Discord