Generative Multimodal Models are In-Context Learners
About
The human ability to easily solve multimodal tasks in context (i.e., with only a few demonstrations or simple instructions), is what current multimodal systems have largely struggled to imitate. In this work, we demonstrate that the task-agnostic in-context learning capabilities of large multimodal models can be significantly enhanced by effective scaling-up. We introduce Emu2, a generative multimodal model with 37 billion parameters, trained on large-scale multimodal sequences with a unified autoregressive objective. Emu2 exhibits strong multimodal in-context learning abilities, even emerging to solve tasks that require on-the-fly reasoning, such as visual prompting and object-grounded generation. The model sets a new record on multiple multimodal understanding tasks in few-shot settings. When instruction-tuned to follow specific instructions, Emu2 further achieves new state-of-the-art on challenging tasks such as question answering benchmarks for large multimodal models and open-ended subject-driven generation. These achievements demonstrate that Emu2 can serve as a base model and general-purpose interface for a wide range of multimodal tasks. Code and models are publicly available to facilitate future research.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Question Answering | VQA v2 | Accuracy84.9 | 1165 | |
| Visual Question Answering | TextVQA | Accuracy66.6 | 1117 | |
| Visual Question Answering | VizWiz | Accuracy57 | 1043 | |
| Visual Question Answering | GQA | Accuracy65.1 | 963 | |
| Visual Question Answering | VQA v2 (test-dev) | Overall Accuracy84.9 | 664 | |
| Video Question Answering | MSRVTT-QA | Accuracy31.4 | 481 | |
| Multimodal Understanding | MM-Vet | MM-Vet Score48.5 | 418 | |
| Referring Expression Comprehension | RefCOCO+ (val) | Accuracy87.05 | 345 | |
| Video Question Answering | MSVD-QA | Accuracy49 | 340 | |
| Referring Expression Comprehension | RefCOCO (val) | Accuracy90.4 | 335 |