Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

ITO: Images and Texts as One via Synergizing Multiple Alignment and Training-Time Fusion

About

Image-text contrastive pretraining has become a dominant paradigm for visual representation learning, yet existing methods often yield representations that remain partially organized by modality. We propose ITO, a framework addressing this limitation through two synergistic mechanisms. Multimodal multiple alignment enriches supervision by mining diverse image-text correspondences, while a lightweight training-time multimodal fusion module enforces structured cross-modal interaction. Crucially, the fusion module is discarded at inference, preserving the efficiency of standard dual-encoder architectures. Extensive experiments show that ITO consistently outperforms strong baselines across classification, retrieval, and multimodal benchmarks. Our analysis reveals that while multiple alignment drives discriminative power, training-time fusion acts as a critical structural regularizer -- eliminating the modality gap and stabilizing training dynamics to prevent the early saturation often observed in aggressive contrastive learning.

Hanpeng Liu, Yaqian Li, Zidan Wang, Shuoxi Zhang, Zonglin Zhao, Zihao Bo, Rinyoichi Takezoe, Kaiwen Long, Kun He• 2026

Related benchmarks

TaskDatasetResultRank
Image ClassificationImageNet-1K 1.0 (val)
Top-1 Accuracy76.2
1952
Visual Question AnsweringVizWiz
Accuracy44.25
1525
Object Hallucination EvaluationPOPE--
1455
Visual Question AnsweringVQA v2
Accuracy73.19
1362
Visual Question AnsweringGQA
Accuracy59.99
1249
Image ClassificationImageNet A
Top-1 Acc45.7
654
Multimodal UnderstandingMMBench--
637
Image ClassificationStanford Cars--
635
Image ClassificationImageNet V2
Top-1 Acc66.3
611
Image ClassificationEuroSAT--
569
Showing 10 of 74 rows
...

Other info

Follow for update