Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

CSA: Data-efficient Mapping of Unimodal Features to Multimodal Features

About

Multimodal encoders like CLIP excel in tasks such as zero-shot image classification and cross-modal retrieval. However, they require excessive training data. We propose canonical similarity analysis (CSA), which uses two unimodal encoders to replicate multimodal encoders using limited data. CSA maps unimodal features into a multimodal space, using a new similarity score to retain only the multimodal information. CSA only involves the inference of unimodal encoders and a cubic-complexity matrix decomposition, eliminating the need for extensive GPU-based model training. Experiments show that CSA outperforms CLIP while requiring $50,000\times$ fewer multimodal data pairs to bridge the modalities given pre-trained unimodal encoders on ImageNet classification and misinformative news caption detection. CSA surpasses the state-of-the-art method to map unimodal features to multimodal features. We also demonstrate the ability of CSA with modalities beyond image and text, paving the way for future modality pairs with limited paired multimodal data but abundant unpaired unimodal data, such as lidar and text.

Po-han Li, Sandeep P. Chinchali, Ufuk Topcu• 2024

Related benchmarks

TaskDatasetResultRank
Image ClassificationCIFAR-10
Accuracy85.39
875
Text-to-Image RetrievalFlickr30K
R@125.44
559
Text-to-Image RetrievalFlickr30k (test)
Recall@130.92
525
ClassificationCars
Accuracy1.42
492
Image-to-Text RetrievalFlickr30k (test)
R@144
472
Image-to-Text RetrievalFlickr30K
R@134.8
451
Image ClassificationCIFAR-100
Accuracy40.8
357
Image ClassificationPets
Accuracy8.45
308
Image ClassificationFood101
Accuracy29.82
177
Text-to-Image RetrievalCOCO--
156
Showing 10 of 27 rows

Other info

Follow for update