Meerkat: Audio-Visual Large Language Model for Grounding in Space and Time
About
Leveraging Large Language Models' remarkable proficiency in text-based tasks, recent works on Multi-modal LLMs (MLLMs) extend them to other modalities like vision and audio. However, the progress in these directions has been mostly focused on tasks that only require a coarse-grained understanding of the audio-visual semantics. We present Meerkat, an audio-visual LLM equipped with a fine-grained understanding of image and audio both spatially and temporally. With a new modality alignment module based on optimal transport and a cross-attention module that enforces audio-visual consistency, Meerkat can tackle challenging tasks such as audio referred image grounding, image guided audio temporal localization, and audio-visual fact-checking. Moreover, we carefully curate a large dataset AVFIT that comprises 3M instruction tuning samples collected from open-source datasets, and introduce MeerkatBench that unifies five challenging audio-visual tasks. We achieve state-of-the-art performance on all these downstream tasks with a relative improvement of up to 37.12%.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Sound Source Localization | Flickr SoundNet (test) | CIoU88.35 | 28 | |
| Audio referred image grounding | VGG-SS (test) | cIoU48.51 | 10 | |
| Audio referred image grounding | PascalSound (test) | cIoU65.23 | 10 | |
| Audio referred image grounding | AVSBench (test) | cIoU79.82 | 10 | |
| Audio-Visual Question Answering | AVQA (val) | Existence Accuracy88.24 | 9 | |
| Audio-Visual Question Answering | MUSIC-AVQA balanced (test) | Existential Score83.62 | 8 | |
| Audio-Visual Captioning | VALOR 32K (val) | BLEU@416.88 | 7 | |
| Audio-Visual Fact-checking | AVFact | Type 1 F1-score85 | 7 | |
| Image Guided Audio Temporal Localization | LLP (test) | F1 Score54.96 | 5 | |
| Image Guided Audio Temporal Localization | AudioSet Strong (test) | F1 Score56.85 | 5 |