Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

ACDiT: Interpolating Autoregressive Conditional Modeling and Diffusion Transformer

About

Autoregressive and diffusion models have achieved remarkable progress in language models and visual generation, respectively. We present ACDiT, a novel Autoregressive blockwise Conditional Diffusion Transformer, that innovatively combines autoregressive and diffusion paradigms for continuous visual information. By introducing a block-wise autoregressive unit, ACDiT offers a flexible interpolation between token-wise autoregression and full-sequence diffusion, bypassing the limitations of discrete tokenization. The generation of each block is formulated as a conditional diffusion process, conditioned on prior blocks. ACDiT is easy to implement, as simple as applying a specially designed Skip-Causal Attention Mask on the standard diffusion transformer during training. During inference, the process iterates between diffusion denoising and autoregressive decoding that can make full use of KV-Cache. We validate the effectiveness of ACDiT on image, video, and text generation and show that ACDiT performs best among all autoregressive baselines under similar model scales on visual generation tasks. We also demonstrate that, benefiting from autoregressive modeling, pretrained ACDiT can be transferred in visual understanding tasks despite being trained with the generative objective. The analysis of the trade-off between autoregressive and diffusion demonstrates the potential of ACDiT to be used in long-horizon visual generation tasks. We hope that ACDiT offers a novel perspective on visual autoregressive generation and sheds light on new avenues for unified models.

Jinyi Hu, Shengding Hu, Yuxuan Song, Yufei Huang, Mingxuan Wang, Hao Zhou, Zhiyuan Liu, Wei-Ying Ma, Maosong Sun• 2024

Related benchmarks

TaskDatasetResultRank
Image ClassificationImageNet (val)
Top-1 Acc84
1206
Image GenerationImageNet 256x256 (test val)
FID2.37
35
Class-Conditional Video GenerationUCF101--
19
Video GenerationUCF-101
FVD90
17
Class-to-video generationUCF-101
FVD90
13
Text GenerationOpenWebText (test)--
8
Showing 6 of 6 rows

Other info

Code

Follow for update