Think Before You Diffuse: Infusing Physical Rules into Video Diffusion

About

Recent video diffusion models have demonstrated their great capability in generating visually-pleasing results, while synthesizing the correct physical effects in generated videos remains challenging. The complexity of real-world motions, interactions, and dynamics introduce great difficulties when learning physics from data. In this work, we propose DiffPhy, a generic framework that enables physically-correct and photo-realistic video generation by fine-tuning a pre-trained video diffusion model. Our method leverages large language models (LLMs) to infer rich physical context from the text prompt. To incorporate this context into the video diffusion model, we use a multimodal large language model (MLLM) to verify intermediate latent variables against the inferred physical rules, guiding the gradient updates of model accordingly. Textual output of LLM is transformed into continuous signals. We then formulate a set of training objectives that jointly ensure physical accuracy and semantic alignment with the input text. Additionally, failure facts of physical phenomena are corrected via attention injection. We also establish a high-quality physical video dataset containing diverse phyiscal actions and events to facilitate effective finetuning. Extensive experiments on public benchmarks demonstrate that DiffPhy is able to produce state-of-the-art results across diverse physics-related scenarios. Our project page is available at https://bwgzk-keke.github.io/DiffPhy/.

Ke Zhang, Cihan Xiao, Jiacong Xu, Yiqun Mei, Vishal M. Patel• 2025

Related benchmarks

Task	Dataset	Result	Rank
Video Generation	PhyGenBench	PCA Score0.54		13
Physics-aware Video Generation	PhyGenBench	Mechanics PD0.73		4

Showing 2 of 2 rows

Other info

Follow for update

@wizwand_team Discord