ITO: Images and Texts as One via Synergizing Multiple Alignment and Training-Time Fusion
About
Image-text contrastive pretraining has become a dominant paradigm for visual representation learning, yet existing methods often yield representations that remain partially organized by modality. We propose ITO, a framework addressing this limitation through two synergistic mechanisms. Multimodal multiple alignment enriches supervision by mining diverse image-text correspondences, while a lightweight training-time multimodal fusion module enforces structured cross-modal interaction. Crucially, the fusion module is discarded at inference, preserving the efficiency of standard dual-encoder architectures. Extensive experiments show that ITO consistently outperforms strong baselines across classification, retrieval, and multimodal benchmarks. Our analysis reveals that while multiple alignment drives discriminative power, training-time fusion acts as a critical structural regularizer -- eliminating the modality gap and stabilizing training dynamics to prevent the early saturation often observed in aggressive contrastive learning.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Classification | ImageNet-1K 1.0 (val) | Top-1 Accuracy76.2 | 2238 | |
| Object Hallucination Evaluation | POPE | -- | 2019 | |
| Visual Question Answering | VizWiz | Accuracy44.25 | 1820 | |
| Visual Question Answering | VQA v2 | Accuracy73.19 | 1429 | |
| Visual Question Answering | GQA | Accuracy59.99 | 1425 | |
| Multimodal Understanding | MMBench | -- | 847 | |
| Image Classification | ImageNet V2 | Top-1 Acc66.3 | 749 | |
| Image Classification | ImageNet A | Top-1 Acc45.7 | 698 | |
| Image Classification | Stanford Cars | -- | 660 | |
| Multimodal Understanding | MM-Vet | MM-Vet Score21.9 | 631 |