Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

FILIP: Fine-grained Interactive Language-Image Pre-Training

About

Unsupervised large-scale vision-language pre-training has shown promising advances on various downstream tasks. Existing methods often model the cross-modal interaction either via the similarity of the global feature of each modality which misses sufficient information, or finer-grained interactions using cross/self-attention upon visual and textual tokens. However, cross/self-attention suffers from inferior efficiency in both training and inference. In this paper, we introduce a large-scale Fine-grained Interactive Language-Image Pre-training (FILIP) to achieve finer-level alignment through a cross-modal late interaction mechanism, which uses a token-wise maximum similarity between visual and textual tokens to guide the contrastive objective. FILIP successfully leverages the finer-grained expressiveness between image patches and textual words by modifying only contrastive loss, while simultaneously gaining the ability to pre-compute image and text representations offline at inference, keeping both large-scale training and inference efficient. Furthermore, we construct a new large-scale image-text pair dataset called FILIP300M for pre-training. Experiments show that FILIP achieves state-of-the-art performance on multiple downstream vision-language tasks including zero-shot image classification and image-text retrieval. The visualization on word-patch alignment further shows that FILIP can learn meaningful fine-grained features with promising localization ability.

Lewei Yao, Runhui Huang, Lu Hou, Guansong Lu, Minzhe Niu, Hang Xu, Xiaodan Liang, Zhenguo Li, Xin Jiang, Chunjing Xu• 2021

Related benchmarks

TaskDatasetResultRank
Image ClassificationImageNet-1k (val)
Top-1 Accuracy39.5
840
Image ClassificationCIFAR-100
Top-1 Accuracy75.3
622
Image ClassificationEuroSAT--
497
Image ClassificationFood-101
Accuracy43.1
494
Image ClassificationDTD
Accuracy60.7
487
Image ClassificationStanford Cars--
477
Text-to-Image RetrievalFlickr30K
R@175
460
Image-to-Text RetrievalFlickr30K 1K (test)
R@196.6
439
Image ClassificationImageNet
Top-1 Accuracy78.3
429
Image ClassificationSUN397
Accuracy50.7
425
Showing 10 of 68 rows

Other info

Follow for update