Phased Consistency Models

About

Consistency Models (CMs) have made significant progress in accelerating the generation of diffusion models. However, their application to high-resolution, text-conditioned image generation in the latent space remains unsatisfactory. In this paper, we identify three key flaws in the current design of Latent Consistency Models (LCMs). We investigate the reasons behind these limitations and propose Phased Consistency Models (PCMs), which generalize the design space and address the identified limitations. Our evaluations demonstrate that PCMs outperform LCMs across 1--16 step generation settings. While PCMs are specifically designed for multi-step refinement, they achieve comparable 1-step generation results to previously state-of-the-art specifically designed 1-step methods. Furthermore, we show the methodology of PCMs is versatile and applicable to video generation, enabling us to train the state-of-the-art few-step text-to-video generator. Our code is available at https://github.com/G-U-N/Phased-Consistency-Model.

Fu-Yun Wang, Zhaoyang Huang, Alexander William Bergman, Dazhong Shen, Peng Gao, Michael Lingelbach, Keqiang Sun, Weikang Bian, Guanglu Song, Yu Liu, Xiaogang Wang, Hongsheng Li• 2024

Related benchmarks

Task	Dataset	Result
Text-to-Image Generation	GenEval	GenEval Score49.44	459
Text-to-Image Generation	MS-COCO (val)	FID14.7	215
Text-to-Image Generation	COCO 2014 (val)	FID11.7	69
Text-to-Image Generation	MS COCO zero-shot	FID17.91	64
Text-to-Image Generation	Text-to-Image Generation	CLIP Score0.2996	34
Text-to-Image Generation	GenEval (val)	GenEval Score55	33
Image Generation	CC3M SDXL v1.0 (test)	FID37.26	27
Aesthetic Evaluation	CC3M SDXL 1.0 (test)	HPS0.2731	27
Image-to-Video Generation	VBench I2V	Background Consistency97.34	24
Text-to-Image Generation	COCO 5k	CLIP Score0.3242	19

Showing 10 of 14 rows

Other info

Follow for update

@wizwand_team Discord