OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework
About
In this work, we pursue a unified paradigm for multimodal pretraining to break the scaffolds of complex task/modality-specific customization. We propose OFA, a Task-Agnostic and Modality-Agnostic framework that supports Task Comprehensiveness. OFA unifies a diverse set of cross-modal and unimodal tasks, including image generation, visual grounding, image captioning, image classification, language modeling, etc., in a simple sequence-to-sequence learning framework. OFA follows the instruction-based learning in both pretraining and finetuning stages, requiring no extra task-specific layers for downstream tasks. In comparison with the recent state-of-the-art vision & language models that rely on extremely large cross-modal datasets, OFA is pretrained on only 20M publicly available image-text pairs. Despite its simplicity and relatively small-scale training data, OFA achieves new SOTAs in a series of cross-modal tasks while attaining highly competitive performances on uni-modal tasks. Our further analysis indicates that OFA can also effectively transfer to unseen tasks and unseen domains. Our code and models are publicly available at https://github.com/OFA-Sys/OFA.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Captioning | MS COCO Karpathy (test) | CIDEr146.7 | 682 | |
| Visual Question Answering | VQA v2 (test-dev) | Overall Accuracy82 | 664 | |
| Image Classification | ImageNet-1K | Top-1 Acc85.6 | 524 | |
| Image Classification | Flowers102 | Accuracy96.9 | 478 | |
| Visual Question Answering | VQA v2 (test-std) | Accuracy82 | 466 | |
| Natural Language Understanding | GLUE | SST-296.6 | 452 | |
| Referring Expression Comprehension | RefCOCO+ (val) | Accuracy85.8 | 345 | |
| Visual Question Answering | VQA 2.0 (test-dev) | Accuracy82 | 337 | |
| Referring Expression Comprehension | RefCOCO (val) | Accuracy90.05 | 335 | |
| Referring Expression Comprehension | RefCOCO (testA) | Accuracy83.87 | 333 |