Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model

About

We present Any-Modality Augmented Language Model (AnyMAL), a unified model that reasons over diverse input modality signals (i.e. text, image, video, audio, IMU motion sensor), and generates textual responses. AnyMAL inherits the powerful text-based reasoning abilities of the state-of-the-art LLMs including LLaMA-2 (70B), and converts modality-specific signals to the joint textual space through a pre-trained aligner module. To further strengthen the multimodal LLM's capabilities, we fine-tune the model with a multimodal instruction set manually collected to cover diverse topics and tasks beyond simple QAs. We conduct comprehensive empirical analysis comprising both human and automatic evaluations, and demonstrate state-of-the-art performance on various multimodal tasks.

Seungwhan Moon, Andrea Madotto, Zhaojiang Lin, Tushar Nagarajan, Matt Smith, Shashank Jain, Chun-Fu Yeh, Prakash Murugesan, Peyman Heidari, Yue Liu, Kavya Srinet, Babak Damavandi, Anuj Kumar• 2023

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringVQA v2
Accuracy67.8
1165
Visual Question AnsweringTextVQA
Accuracy35.4
1117
Visual Question AnsweringVizWiz
Accuracy41.3
1043
Visual Question AnsweringVQA v2 (test-dev)
Overall Accuracy64.2
664
Visual Question AnsweringOKVQA
Top-1 Accuracy64.2
283
Visual Question AnsweringOK-VQA
Accuracy46.1
224
Visual Question AnsweringScienceQA
Accuracy70.8
210
Visual Question AnsweringVQAv2
Accuracy59.6
177
Audio CaptioningAudioCaps (test)
CIDEr77.8
140
Image CaptioningMS-COCO (test)
CIDEr95.9
117
Showing 10 of 19 rows

Other info

Follow for update