VideoREPA: Learning Physics for Video Generation through Relational Alignment with Foundation Models

About

Recent advancements in text-to-video (T2V) diffusion models have enabled high-fidelity and realistic video synthesis. However, current T2V models often struggle to generate physically plausible content due to their limited inherent ability to accurately understand physics. We found that while the representations within T2V models possess some capacity for physics understanding, they lag significantly behind those from recent video self-supervised learning methods. To this end, we propose a novel framework called VideoREPA, which distills physics understanding capability from video understanding foundation models into T2V models by aligning token-level relations. This closes the physics understanding gap and enable more physics-plausible generation. Specifically, we introduce the Token Relation Distillation (TRD) loss, leveraging spatio-temporal alignment to provide soft guidance suitable for finetuning powerful pre-trained T2V models, a critical departure from prior representation alignment (REPA) methods. To our knowledge, VideoREPA is the first REPA method designed for finetuning T2V models and specifically for injecting physical knowledge. Empirical evaluations show that VideoREPA substantially enhances the physics commonsense of baseline method, CogVideoX, achieving significant improvement on relevant benchmarks and demonstrating a strong capacity for generating videos consistent with intuitive physics. More video results are available at https://videorepa.github.io/.

Xiangdong Zhang, Jiaqi Liao, Shaofeng Zhang, Fanqing Meng, Xiangpeng Wan, Junchi Yan, Yu Cheng• 2025

Related benchmarks

Task	Dataset	Result
Video Generation	VBench	Quality Score77	126
Video Generation	VideoPhy	SA (%)72	50
Video Generation	VBench 2.0 (test)	Total Score55.24	49
Text-to-Video Generation	VideoPhy	PC Score40.1	41
Text-to-Video Generation	VideoPhy2 HARD	PC Score52.2	28
Text-to-Video Generation	VideoPhy 2	SA Score21.02	22
Video Generation	VBench 1.0 (test)	--	21
Physical Plausibility Evaluation	VideoPhy Hard 2	PC Score86.1	20
Text-to-Video Generation	VideoPhy2 (ALL)	PC Score72.54	16
Physical Plausibility Evaluation	VideoPhy	Average PC40.1	16

Showing 10 of 35 rows

Other info

Follow for update

@wizwand_team Discord