Inference-time Physics Alignment of Video Generative Models with Latent World Models

About

State-of-the-art video generative models produce promising visual content yet often violate basic physics principles, limiting their utility. While some attribute this deficiency to insufficient physics understanding from pre-training, we find that the shortfall in physics plausibility also stems from suboptimal inference strategies. We therefore introduce WMReward and treat improving physics plausibility of video generation as an inference-time alignment problem. In particular, we leverage the strong physics prior of a latent world model (here, VJEPA-2) as a reward to search and steer multiple candidate denoising trajectories, enabling scaling test-time compute for better generation performance. Empirically, our approach substantially improves physics plausibility across image-conditioned, multiframe-conditioned, and text-conditioned generation settings, with validation from human preference study. Notably, in the ICCV 2025 Perception Test PhysicsIQ Challenge, we achieve a final score of 62.64%, winning first place and outperforming the previous state of the art by 7.42%. Our work demonstrates the viability of using latent world models to improve physics plausibility of video generation, beyond this specific instantiation or parameterization.

Jianhao Yuan, Xiaofeng Zhang, Felix Friedrich, Nicolas Beltran-Velez, Melissa Hall, Reyhane Askari-Hemmat, Xiaochuang Han, Nicolas Ballas, Michal Drozdzal, Adriana Romero-Soriano• 2026

Related benchmarks

Task	Dataset	Result
Video Generation	Physics-IQ	Phys. IQ Score62	63
Text-to-Video Generation	VideoPhy	--	41
Text-to-Video Generation	VideoPhy2 HARD	PC Score70	28
Physical Plausibility Evaluation	VideoPhy Hard 2	PC Score52.2	20
Text-to-Video Generation	VideoPhy2 (Standard)	SA Score63.9	18
Physical and Semantic Video Evaluation	VideoPhy2 (Standard)	SA Score67.3	12
Physical Video Evaluation	Physics-IQ	Score35.5	12
Human Preference Evaluation	PhysicsIQ 1.0 (test)	Physics Plausibility Win Rate54.9	4
Human Preference Evaluation	VideoPhy 1.0 (test)	Physics Plausibility Win Rate59.3	4

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord