Seer: Language Instructed Video Prediction with Latent Diffusion Models

About

Imagining the future trajectory is the key for robots to make sound planning and successfully reach their goals. Therefore, text-conditioned video prediction (TVP) is an essential task to facilitate general robot policy learning. To tackle this task and empower robots with the ability to foresee the future, we propose a sample and computation-efficient model, named \textbf{Seer}, by inflating the pretrained text-to-image (T2I) stable diffusion models along the temporal axis. We enhance the U-Net and language conditioning model by incorporating computation-efficient spatial-temporal attention. Furthermore, we introduce a novel Frame Sequential Text Decomposer module that dissects a sentence's global instruction into temporally aligned sub-instructions, ensuring precise integration into each frame of generation. Our framework allows us to effectively leverage the extensive prior knowledge embedded in pretrained T2I models across the frames. With the adaptable-designed architecture, Seer makes it possible to generate high-fidelity, coherent, and instruction-aligned video frames by fine-tuning a few layers on a small amount of data. The experimental results on Something Something V2 (SSv2), Bridgedata and EpicKitchens-100 datasets demonstrate our superior video prediction performance with around 480-GPU hours versus CogVideo with over 12,480-GPU hours: achieving the 31% FVD improvement compared to the current SOTA model on SSv2 and 83.7% average preference in the human evaluation.

Xianfan Gu, Chuan Wen, Weirui Ye, Jiaming Song, Yang Gao• 2023

Related benchmarks

Task	Dataset	Result
Video Prediction	UCF-101 (test)	FVD260.7	19
Class-conditioned Image-to-Video Generation	Something-Something v2	FVD112.9	9
Text-conditioned Video Prediction	Something-Something v2	FVD112.9	8
Text-conditioned Video Prediction	BridgeData	FVD246.3	8
Text-conditioned Video Prediction	Epic Kitchens 100	FVD271.4	8
Class-conditioned Image-to-Video Generation	Epic Kitchens 100	FVD271.4	8
Language-driven motion control in Text-to-Video generation	SSv2 (val)	FVD287.5	8
Video Generation	Something-Something v2 (test val)	FID33.35	6
Video Prediction	Bridge (val)	FVD246.3	4
button-press-topdown-v2	Meta-World (in-domain)	Success Rate45	2

Showing 10 of 13 rows

Other info

Follow for update

@wizwand_team Discord