ITO: Images and Texts as One via Synergizing Multiple Alignment and Training-Time Fusion

About

Image-text contrastive pretraining has become a dominant paradigm for visual representation learning, yet existing methods often yield representations that remain partially organized by modality. We propose ITO, a framework addressing this limitation through two synergistic mechanisms. Multimodal multiple alignment enriches supervision by mining diverse image-text correspondences, while a lightweight training-time multimodal fusion module enforces structured cross-modal interaction. Crucially, the fusion module is discarded at inference, preserving the efficiency of standard dual-encoder architectures. Extensive experiments show that ITO consistently outperforms strong baselines across classification, retrieval, and multimodal benchmarks. Our analysis reveals that while multiple alignment drives discriminative power, training-time fusion acts as a critical structural regularizer -- eliminating the modality gap and stabilizing training dynamics to prevent the early saturation often observed in aggressive contrastive learning.

Hanpeng Liu, Yaqian Li, Zidan Wang, Shuoxi Zhang, Zonglin Zhao, Zihao Bo, Rinyoichi Takezoe, Kaiwen Long, Kun He• 2026

Related benchmarks

Task	Dataset	Result
Image Classification	ImageNet-1K 1.0 (val)	Top-1 Accuracy76.2	2238
Object Hallucination Evaluation	POPE	--	2019
Visual Question Answering	VizWiz	Accuracy44.25	1820
Visual Question Answering	VQA v2	Accuracy73.19	1429
Visual Question Answering	GQA	Accuracy59.99	1425
Multimodal Understanding	MMBench	--	847
Image Classification	ImageNet V2	Top-1 Acc66.3	749
Image Classification	ImageNet A	Top-1 Acc45.7	698
Image Classification	Stanford Cars	--	660
Multimodal Understanding	MM-Vet	MM-Vet Score21.9	631

Showing 10 of 74 rows

...

Other info

Follow for update

@wizwand_team Discord