Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

MixFlow Training: Alleviating Exposure Bias with Slowed Interpolation Mixture

About

This paper studies the training-testing discrepancy (a.k.a. exposure bias) problem for improving the diffusion models. During training, the input of a prediction network at one training timestep is the corresponding ground-truth noisy data that is an interpolation of the noise and the data, and during testing, the input is the generated noisy data. We present a novel training approach, named MixFlow, for improving the performance. Our approach is motivated by the Slow Flow phenomenon: the ground-truth interpolation that is the nearest to the generated noisy data at a given sampling timestep is observed to correspond to a higher-noise timestep (termed slowed timestep), i.e., the corresponding ground-truth timestep is slower than the sampling timestep. MixFlow leverages the interpolations at the slowed timesteps, named slowed interpolation mixture, for post-training the prediction network for each training timestep. Experiments over class-conditional image generation (including SiT, REPA, and RAE) and text-to-image generation validate the effectiveness of our approach. Our approach MixFlow over the RAE models achieve strong generation results on ImageNet: 1.43 FID (without guidance) and 1.10 (with guidance) at 256 x 256, and 1.55 FID (without guidance) and 1.10 (with guidance) at 512 x 512.

Hui Li, Jiayue Lyu, Fu-Yun Wang, Kaihui Cheng, Siyu Zhu, Jingdong Wang• 2025

Related benchmarks

TaskDatasetResultRank
Class-conditional Image GenerationImageNet 256x256 (train val)--
178
Class-conditional Image GenerationImageNet--
132
Text-to-Image GenerationT2I-CompBench
Shape Fidelity69.12
94
Text-to-Image GenerationGenEval--
87
Class-conditional Image GenerationImageNet 512x512 (val)--
69
Class-conditional generationImageNet 256 x 256 1k (val)--
67
Text-to-Image GenerationDPG-Bench
Average Score86.16
24
Showing 7 of 7 rows

Other info

Follow for update