Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Unifying Contrastive and Generative Objectives for Visual Understanding and Text-to-Image Generation

About

Unifying text-image contrastive learning and text-to-image (T2I) generation in a single end-to-end model is challenging because the two objectives demand opposing masking regimes: contrastive alignment needs near-complete visible tokens, while masked generative modeling needs heavy corruption. We introduce DREAM, a unified framework that resolves this conflict through Masking Warmup, a schedule that shifts the center of the masking distribution over training, so low and high masking ratios coexist at every step. This co-exposure lets a single jointly-trained encoder serve both objectives. The resulting stable optimization unlocks Semantically Aligned Decoding at inference: the text encoder, trained against visual embeddings at all masking ratios, can score partially generated images and select the best trajectory with as little as 12.5% of the image decoded, improving both FID and throughput. DREAM outperforms its single-objective baselines, CLIP and FLUID: on ImageNet linear-probing (+1.1%), 5-shot transfer (+4.1%), ADE20K segmentation (+1.9%), and NYU depth estimation (+6.25%) over CLIP, and on CC12M FID (+6.2%) over FLUID while maintaining CLIP Score. Together, these gains show that text-image contrastive and generative objectives, when properly unified, are synergistic rather than competing.

Chao Li, Tianhong Li, Sai Vidyaranya Nuthalapati, Hong-You Chen, Satya Narayan Shukla, Jianpeng Cheng, Yonghuan Yang, Jun Xiao, Xiangjun Fan, Aashu Singh, Dina Katabi, Shlok Kumar Mishra• 2026

Related benchmarks

TaskDatasetResultRank
Semantic segmentationADE20K (val)
mIoU36.8
3069
Image ClassificationImageNet-1K
Top-1 Acc82.7
1239
Image ClassificationImageNet A
Top-1 Acc32.8
698
Image ClassificationRESISC45
Accuracy93.4
472
Depth EstimationNYU v2 (test)--
435
Image ClassificationObjectNet--
251
Image ClassificationImageNet-R
Accuracy55.3
217
Image ClassificationImageNet-1k (val)
Accuracy82.7
199
Text-to-Image GenerationMS-COCO
FID10.4
145
Image ClassificationImageNet-S
Top-1 Acc42
92
Showing 10 of 28 rows

Other info

GitHub

Follow for update