Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Data-Efficient Multimodal Fusion on a Single GPU

About

The goal of multimodal alignment is to learn a single latent space that is shared between multimodal inputs. The most powerful models in this space have been trained using massive datasets of paired inputs and large-scale computational resources, making them prohibitively expensive to train in many practical scenarios. We surmise that existing unimodal encoders pre-trained on large amounts of unimodal data should provide an effective bootstrap to create multimodal models from unimodal ones at much lower costs. We therefore propose FuseMix, a multimodal augmentation scheme that operates on the latent spaces of arbitrary pre-trained unimodal encoders. Using FuseMix for multimodal alignment, we achieve competitive performance -- and in certain cases outperform state-of-the art methods -- in both image-text and audio-text retrieval, with orders of magnitude less compute and data: for example, we outperform CLIP on the Flickr30K text-to-image retrieval task with $\sim \! 600\times$ fewer GPU days and $\sim \! 80\times$ fewer image-text pairs. Additionally, we show how our method can be applied to convert pre-trained text-to-image generative models into audio-to-image ones. Code is available at: https://github.com/layer6ai-labs/fusemix.

No\"el Vouitsis, Zhaoyan Liu, Satya Krishna Gorti, Valentin Villecroze, Jesse C. Cresswell, Guangwei Yu, Gabriel Loaiza-Ganem, Maksims Volkovs• 2023

Related benchmarks

TaskDatasetResultRank
Image-to-Text RetrievalFlickr30K 1K (test)
R@185.2
439
Text-to-Image RetrievalFlickr30K 1K (test)
R@171.2
375
ClassificationAudiovision-MNIST (test)
Accuracy75
41
Image-to-Text RetrievalCOCO 5K (test)
R@164.3
21
Text-to-Image RetrievalCOCO 5K (test)
R@146.3
16
Text-to-Audio RetrievalAudioCaps 1K 1.0 (test)
Recall@143.1
10
Text-to-Audio RetrievalClotho 1K 1.0 (test)
R@117.6
10
UI Topic ClassificationENRICO (test)
Accuracy80
9
Movie Genre PredictionMM-IMDB (test)
Accuracy64
9
Stock Market PredictionSTOCKS F&B (test)
Accuracy0.54
9
Showing 10 of 14 rows

Other info

Code

Follow for update