Unifying Contrastive and Generative Objectives for Visual Understanding and Text-to-Image Generation
About
Unifying text-image contrastive learning and text-to-image (T2I) generation in a single end-to-end model is challenging because the two objectives demand opposing masking regimes: contrastive alignment needs near-complete visible tokens, while masked generative modeling needs heavy corruption. We introduce DREAM, a unified framework that resolves this conflict through Masking Warmup, a schedule that shifts the center of the masking distribution over training, so low and high masking ratios coexist at every step. This co-exposure lets a single jointly-trained encoder serve both objectives. The resulting stable optimization unlocks Semantically Aligned Decoding at inference: the text encoder, trained against visual embeddings at all masking ratios, can score partially generated images and select the best trajectory with as little as 12.5% of the image decoded, improving both FID and throughput. DREAM outperforms its single-objective baselines, CLIP and FLUID: on ImageNet linear-probing (+1.1%), 5-shot transfer (+4.1%), ADE20K segmentation (+1.9%), and NYU depth estimation (+6.25%) over CLIP, and on CC12M FID (+6.2%) over FLUID while maintaining CLIP Score. Together, these gains show that text-image contrastive and generative objectives, when properly unified, are synergistic rather than competing.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Semantic segmentation | ADE20K (val) | mIoU36.8 | 3069 | |
| Image Classification | ImageNet-1K | Top-1 Acc82.7 | 1239 | |
| Image Classification | ImageNet A | Top-1 Acc32.8 | 698 | |
| Image Classification | RESISC45 | Accuracy93.4 | 472 | |
| Depth Estimation | NYU v2 (test) | -- | 435 | |
| Image Classification | ObjectNet | -- | 251 | |
| Image Classification | ImageNet-R | Accuracy55.3 | 217 | |
| Image Classification | ImageNet-1k (val) | Accuracy82.7 | 199 | |
| Text-to-Image Generation | MS-COCO | FID10.4 | 145 | |
| Image Classification | ImageNet-S | Top-1 Acc42 | 92 |