AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model
About
We present Any-Modality Augmented Language Model (AnyMAL), a unified model that reasons over diverse input modality signals (i.e. text, image, video, audio, IMU motion sensor), and generates textual responses. AnyMAL inherits the powerful text-based reasoning abilities of the state-of-the-art LLMs including LLaMA-2 (70B), and converts modality-specific signals to the joint textual space through a pre-trained aligner module. To further strengthen the multimodal LLM's capabilities, we fine-tune the model with a multimodal instruction set manually collected to cover diverse topics and tasks beyond simple QAs. We conduct comprehensive empirical analysis comprising both human and automatic evaluations, and demonstrate state-of-the-art performance on various multimodal tasks.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Question Answering | VQA v2 | Accuracy67.8 | 1165 | |
| Visual Question Answering | TextVQA | Accuracy35.4 | 1117 | |
| Visual Question Answering | VizWiz | Accuracy41.3 | 1043 | |
| Visual Question Answering | VQA v2 (test-dev) | Overall Accuracy64.2 | 664 | |
| Visual Question Answering | OKVQA | Top-1 Accuracy64.2 | 283 | |
| Visual Question Answering | OK-VQA | Accuracy46.1 | 224 | |
| Visual Question Answering | ScienceQA | Accuracy70.8 | 210 | |
| Visual Question Answering | VQAv2 | Accuracy59.6 | 177 | |
| Audio Captioning | AudioCaps (test) | CIDEr77.8 | 140 | |
| Image Captioning | MS-COCO (test) | CIDEr95.9 | 117 |