FLAVA: A Foundational Language And Vision Alignment Model
About
State-of-the-art vision and vision-and-language models rely on large-scale visio-linguistic pretraining for obtaining good performance on a variety of downstream tasks. Generally, such models are often either cross-modal (contrastive) or multi-modal (with earlier fusion) but not both; and they often only target specific modalities or tasks. A promising direction would be to use a single holistic universal model, as a "foundation", that targets all modalities at once -- a true vision and language foundation model should be good at vision tasks, language tasks, and cross- and multi-modal vision and language tasks. We introduce FLAVA as such a model and demonstrate impressive performance on a wide range of 35 tasks spanning these target modalities.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Question Answering | VQA v2 (test-dev) | Overall Accuracy72.8 | 664 | |
| Image Classification | ImageNet-1K | -- | 524 | |
| Natural Language Understanding | GLUE (dev) | SST-2 (Acc)90.9 | 504 | |
| Image Classification | Food-101 | Accuracy88.5 | 494 | |
| Image Classification | DTD | Accuracy77.3 | 487 | |
| Image Classification | Flowers102 | Accuracy98.1 | 478 | |
| Image Classification | Stanford Cars | Accuracy70.9 | 477 | |
| Text-to-Image Retrieval | Flickr30K | R@165.2 | 460 | |
| Natural Language Understanding | GLUE | SST-290.9 | 452 | |
| Image-to-Text Retrieval | Flickr30K 1K (test) | R@167.7 | 439 |