FILIP: Fine-grained Interactive Language-Image Pre-Training
About
Unsupervised large-scale vision-language pre-training has shown promising advances on various downstream tasks. Existing methods often model the cross-modal interaction either via the similarity of the global feature of each modality which misses sufficient information, or finer-grained interactions using cross/self-attention upon visual and textual tokens. However, cross/self-attention suffers from inferior efficiency in both training and inference. In this paper, we introduce a large-scale Fine-grained Interactive Language-Image Pre-training (FILIP) to achieve finer-level alignment through a cross-modal late interaction mechanism, which uses a token-wise maximum similarity between visual and textual tokens to guide the contrastive objective. FILIP successfully leverages the finer-grained expressiveness between image patches and textual words by modifying only contrastive loss, while simultaneously gaining the ability to pre-compute image and text representations offline at inference, keeping both large-scale training and inference efficient. Furthermore, we construct a new large-scale image-text pair dataset called FILIP300M for pre-training. Experiments show that FILIP achieves state-of-the-art performance on multiple downstream vision-language tasks including zero-shot image classification and image-text retrieval. The visualization on word-patch alignment further shows that FILIP can learn meaningful fine-grained features with promising localization ability.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Classification | ImageNet-1k (val) | Top-1 Accuracy39.5 | 840 | |
| Image Classification | CIFAR-100 | Top-1 Accuracy75.3 | 622 | |
| Image Classification | EuroSAT | -- | 497 | |
| Image Classification | Food-101 | Accuracy43.1 | 494 | |
| Image Classification | DTD | Accuracy60.7 | 487 | |
| Image Classification | Stanford Cars | -- | 477 | |
| Text-to-Image Retrieval | Flickr30K | R@175 | 460 | |
| Image-to-Text Retrieval | Flickr30K 1K (test) | R@196.6 | 439 | |
| Image Classification | ImageNet | Top-1 Accuracy78.3 | 429 | |
| Image Classification | SUN397 | Accuracy50.7 | 425 |