Unifying Contrastive and Generative Objectives for Visual Understanding and Text-to-Image Generation

About

Unifying text-image contrastive learning and text-to-image (T2I) generation in a single end-to-end model is challenging because the two objectives demand opposing masking regimes: contrastive alignment needs near-complete visible tokens, while masked generative modeling needs heavy corruption. We introduce DREAM, a unified framework that resolves this conflict through Masking Warmup, a schedule that shifts the center of the masking distribution over training, so low and high masking ratios coexist at every step. This co-exposure lets a single jointly-trained encoder serve both objectives. The resulting stable optimization unlocks Semantically Aligned Decoding at inference: the text encoder, trained against visual embeddings at all masking ratios, can score partially generated images and select the best trajectory with as little as 12.5% of the image decoded, improving both FID and throughput. DREAM outperforms its single-objective baselines, CLIP and FLUID: on ImageNet linear-probing (+1.1%), 5-shot transfer (+4.1%), ADE20K segmentation (+1.9%), and NYU depth estimation (+6.25%) over CLIP, and on CC12M FID (+6.2%) over FLUID while maintaining CLIP Score. Together, these gains show that text-image contrastive and generative objectives, when properly unified, are synergistic rather than competing.

Chao Li, Tianhong Li, Sai Vidyaranya Nuthalapati, Hong-You Chen, Satya Narayan Shukla, Jianpeng Cheng, Yonghuan Yang, Jun Xiao, Xiangjun Fan, Aashu Singh, Dina Katabi, Shlok Kumar Mishra• 2026

Related benchmarks

Task	Dataset	Result
Semantic segmentation	ADE20K (val)	mIoU36.8	3089
Image Classification	ImageNet-1K	Top-1 Acc82.7	1239
Image Classification	ImageNet A	Top-1 Acc32.8	723
Image Classification	RESISC45	Accuracy93.4	539
Depth Estimation	NYU v2 (test)	--	438
Image Classification	ObjectNet	--	251
Image Classification	ImageNet-R	Accuracy55.3	217
Image Classification	ImageNet-1k (val)	Accuracy82.7	199
Text-to-Image Generation	MS-COCO	FID10.4	193
Image Classification	ImageNet-S	Top-1 Acc42	92

Showing 10 of 28 rows

Other info

GitHub

Follow for update

@wizwand_team Discord