Revisiting Multimodal Representation in Contrastive Learning: From Patch and Token Embeddings to Finite Discrete Tokens

About

Contrastive learning-based vision-language pre-training approaches, such as CLIP, have demonstrated great success in many vision-language tasks. These methods achieve cross-modal alignment by encoding a matched image-text pair with similar feature embeddings, which are generated by aggregating information from visual patches and language tokens. However, direct aligning cross-modal information using such representations is challenging, as visual patches and text tokens differ in semantic levels and granularities. To alleviate this issue, we propose a Finite Discrete Tokens (FDT) based multimodal representation. FDT is a set of learnable tokens representing certain visual-semantic concepts. Both images and texts are embedded using shared FDT by first grounding multimodal inputs to FDT space and then aggregating the activated FDT representations. The matched visual and semantic concepts are enforced to be represented by the same set of discrete tokens by a sparse activation constraint. As a result, the granularity gap between the two modalities is reduced. Through both quantitative and qualitative analyses, we demonstrate that using FDT representations in CLIP-style models improves cross-modal alignment and performance in visual recognition and vision-language downstream tasks. Furthermore, we show that our method can learn more comprehensive representations, and the learned FDT capture meaningful cross-modal correspondence, ranging from objects to actions and attributes.

Yuxiao Chen, Jianbo Yuan, Yu Tian, Shijie Geng, Xinyu Li, Ding Zhou, Dimitris N. Metaxas, Hongxia Yang• 2023

Related benchmarks

Task	Dataset	Result
Image Classification	ImageNet-1K	Top-1 Acc34.2	1239
Image Classification	ImageNet A	Top-1 Acc5.2	723
Image Classification	CIFAR-100	Accuracy36	691
Image Classification	ImageNet-R	Top-1 Acc48.8	622
Image Classification	Food-101	Accuracy18	590
Image Classification	Flowers102	Accuracy24.8	558
Image Classification	RESISC45	Accuracy14	539
Text-to-Image Retrieval	Flickr30k (test)	--	528
Image-to-Text Retrieval	Flickr30k (test)	--	472
Image Classification	SUN397	Accuracy25	425

Showing 10 of 44 rows

Other info

Follow for update

@wizwand_team Discord