LLM-grounded Video Diffusion Models

About

Text-conditioned diffusion models have emerged as a promising tool for neural video generation. However, current models still struggle with intricate spatiotemporal prompts and often generate restricted or incorrect motion. To address these limitations, we introduce LLM-grounded Video Diffusion (LVD). Instead of directly generating videos from the text inputs, LVD first leverages a large language model (LLM) to generate dynamic scene layouts based on the text inputs and subsequently uses the generated layouts to guide a diffusion model for video generation. We show that LLMs are able to understand complex spatiotemporal dynamics from text alone and generate layouts that align closely with both the prompts and the object motion patterns typically observed in the real world. We then propose to guide video diffusion models with these layouts by adjusting the attention maps. Our approach is training-free and can be integrated into any video diffusion model that admits classifier guidance. Our results demonstrate that LVD significantly outperforms its base video diffusion model and several strong baseline methods in faithfully generating videos with the desired attributes and motion patterns.

Long Lian, Baifeng Shi, Adam Yala, Trevor Darrell, Boyi Li• 2023

Related benchmarks

Task	Dataset	Result
Text-to-Video Generation	T2V-CompBench	Consistency Attribute Score0.5595	92
Video Generation	PhyGenBench (test)	Mechanics Fidelity Score32	14
Text-to-Video Generation	TC-Bench	Attribute Transition TCR5.77	8
Text-to-Video Generation	Compositional Prompts	VBLIP-VQA Score0.482	7
Spatio-temporal control	DAVIS 16 (test)	mIoU26.1	5
Spatio-temporal control	LaSOT (test)	mIoU13.5	5
Spatio-temporal control	IMC	mIoU36.1	5
Spatio-temporal control	ssv2 ST	mIoU27.2	5
Layout-guided video generation	YouTubeVIS 2021 (test val)	FVD558	5

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord