Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Pyramidal Flow Matching for Efficient Video Generative Modeling

About

Video generation requires modeling a vast spatiotemporal space, which demands significant computational resources and data usage. To reduce the complexity, the prevailing approaches employ a cascaded architecture to avoid direct training with full resolution latent. Despite reducing computational demands, the separate optimization of each sub-stage hinders knowledge sharing and sacrifices flexibility. This work introduces a unified pyramidal flow matching algorithm. It reinterprets the original denoising trajectory as a series of pyramid stages, where only the final stage operates at the full resolution, thereby enabling more efficient video generative modeling. Through our sophisticated design, the flows of different pyramid stages can be interlinked to maintain continuity. Moreover, we craft autoregressive video generation with a temporal pyramid to compress the full-resolution history. The entire framework can be optimized in an end-to-end manner and with a single unified Diffusion Transformer (DiT). Extensive experiments demonstrate that our method supports generating high-quality 5-second (up to 10-second) videos at 768p resolution and 24 FPS within 20.7k A100 GPU training hours. All code and models are open-sourced at https://pyramid-flow.github.io.

Yang Jin, Zhicheng Sun, Ningyuan Li, Kun Xu, Kun Xu, Hao Jiang, Nan Zhuang, Quzhe Huang, Yang Song, Yadong Mu, Zhouchen Lin• 2024

Related benchmarks

TaskDatasetResultRank
Class-conditional Image GenerationImageNet 256x256--
815
Text-to-Video GenerationVBench
Quality Score84.74
155
Video GenerationVBench
Quality Score84.74
126
Video GenerationVBench 5s
Total Score82.66
58
Video GenerationVBench (test)
Semantic Score69.62
48
Video GenerationVBench 2.0 (test)
Total Score81.72
44
Video Generationshort videos 81-frames 240 prompts
Total Score4.55
38
Unconditional Image GenerationCelebA-HQ 256x256
Fréchet Distance (FD)11.2
27
Long Video Generation120, 240, 720 and 1440-frames long videos
Total Score2.85
20
Video GenerationVBench short video (test)
Subject Consistency69.62
16
Showing 10 of 31 rows

Other info

Follow for update