FLAVA: A Foundational Language And Vision Alignment Model
About
State-of-the-art vision and vision-and-language models rely on large-scale visio-linguistic pretraining for obtaining good performance on a variety of downstream tasks. Generally, such models are often either cross-modal (contrastive) or multi-modal (with earlier fusion) but not both; and they often only target specific modalities or tasks. A promising direction would be to use a single holistic universal model, as a "foundation", that targets all modalities at once -- a true vision and language foundation model should be good at vision tasks, language tasks, and cross- and multi-modal vision and language tasks. We introduce FLAVA as such a model and demonstrate impressive performance on a wide range of 35 tasks spanning these target modalities.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Question Answering | VQA v2 (test-dev) | Overall Accuracy72.8 | 712 | |
| Image Classification | Stanford Cars | Accuracy70.9 | 660 | |
| Image Classification | ImageNet-1K | -- | 600 | |
| Image Classification | DTD | Accuracy77.3 | 599 | |
| Image Classification | Food-101 | Accuracy88.5 | 570 | |
| Text-to-Image Retrieval | Flickr30K | R@165.2 | 559 | |
| Image Classification | Flowers102 | Accuracy98.1 | 558 | |
| Natural Language Understanding | GLUE | SST-290.9 | 551 | |
| Natural Language Understanding | GLUE (dev) | SST-2 (Acc)90.9 | 529 | |
| Text-to-Image Retrieval | Flickr30k (test) | Recall@165.2 | 525 |