Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

ITO: Images and Texts as One via Synergizing Multiple Alignment and Training-Time Fusion

About

Image-text contrastive pretraining has become a dominant paradigm for visual representation learning, yet existing methods often yield representations that remain partially organized by modality. We propose ITO, a framework addressing this limitation through two synergistic mechanisms. Multimodal multiple alignment enriches supervision by mining diverse image-text correspondences, while a lightweight training-time multimodal fusion module enforces structured cross-modal interaction. Crucially, the fusion module is discarded at inference, preserving the efficiency of standard dual-encoder architectures. Extensive experiments show that ITO consistently outperforms strong baselines across classification, retrieval, and multimodal benchmarks. Our analysis reveals that while multiple alignment drives discriminative power, training-time fusion acts as a critical structural regularizer -- eliminating the modality gap and stabilizing training dynamics to prevent the early saturation often observed in aggressive contrastive learning.

Hanpeng Liu, Yaqian Li, Zidan Wang, Shuoxi Zhang, Zonglin Zhao, Zihao Bo, Rinyoichi Takezoe, Kaiwen Long, Kun He• 2026

Related benchmarks

TaskDatasetResultRank
Image ClassificationImageNet-1K 1.0 (val)
Top-1 Accuracy76.2
2238
Object Hallucination EvaluationPOPE--
2019
Visual Question AnsweringVizWiz
Accuracy44.25
1820
Visual Question AnsweringVQA v2
Accuracy73.19
1429
Visual Question AnsweringGQA
Accuracy59.99
1425
Multimodal UnderstandingMMBench--
847
Image ClassificationImageNet V2
Top-1 Acc66.3
749
Image ClassificationImageNet A
Top-1 Acc45.7
698
Image ClassificationStanford Cars--
660
Multimodal UnderstandingMM-Vet
MM-Vet Score21.9
631
Showing 10 of 74 rows
...

Other info

Follow for update