Hierarchical Patch Diffusion Models for High-Resolution Video Generation
About
Diffusion models have demonstrated remarkable performance in image and video synthesis. However, scaling them to high-resolution inputs is challenging and requires restructuring the diffusion pipeline into multiple independent components, limiting scalability and complicating downstream applications. This makes it very efficient during training and unlocks end-to-end optimization on high-resolution videos. We improve PDMs in two principled ways. First, to enforce consistency between patches, we develop deep context fusion -- an architectural technique that propagates the context information from low-scale to high-scale patches in a hierarchical manner. Second, to accelerate training and inference, we propose adaptive computation, which allocates more network capacity and computation towards coarse image details. The resulting model sets a new state-of-the-art FVD score of 66.32 and Inception Score of 87.68 in class-conditional video generation on UCF-101 $256^2$, surpassing recent methods by more than 100%. Then, we show that it can be rapidly fine-tuned from a base $36\times 64$ low-resolution generator for high-resolution $64 \times 288 \times 512$ text-to-video synthesis. To the best of our knowledge, our model is the first diffusion-based architecture which is trained on such high resolutions entirely end-to-end. Project webpage: https://snap-research.github.io/hpdm.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Generation | UCF-101 (test) | -- | 105 | |
| Text-to-Video Generation | UCF-101 | FVD299.3 | 61 | |
| Text-to-Video Generation | UCF-101 zero-shot | FVD383.3 | 44 | |
| Class-Conditional Video Generation | UCF-101 v1.0 (train test) | FVD66.32 | 21 | |
| Class-Conditional Video Generation | UCF101 | gFVD66 | 19 | |
| Class-conditioned Video Generation | UCF101 (test) | Fréchet Video Distance66.32 | 19 | |
| Zero-shot video generation | UCF-101 v1.0 (train test) | FVD383.3 | 12 |