Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Meerkat: Audio-Visual Large Language Model for Grounding in Space and Time

About

Leveraging Large Language Models' remarkable proficiency in text-based tasks, recent works on Multi-modal LLMs (MLLMs) extend them to other modalities like vision and audio. However, the progress in these directions has been mostly focused on tasks that only require a coarse-grained understanding of the audio-visual semantics. We present Meerkat, an audio-visual LLM equipped with a fine-grained understanding of image and audio both spatially and temporally. With a new modality alignment module based on optimal transport and a cross-attention module that enforces audio-visual consistency, Meerkat can tackle challenging tasks such as audio referred image grounding, image guided audio temporal localization, and audio-visual fact-checking. Moreover, we carefully curate a large dataset AVFIT that comprises 3M instruction tuning samples collected from open-source datasets, and introduce MeerkatBench that unifies five challenging audio-visual tasks. We achieve state-of-the-art performance on all these downstream tasks with a relative improvement of up to 37.12%.

Sanjoy Chowdhury, Sayan Nag, Subhrajyoti Dasgupta, Jun Chen, Mohamed Elhoseiny, Ruohan Gao, Dinesh Manocha• 2024

Related benchmarks

TaskDatasetResultRank
Sound Source LocalizationFlickr SoundNet (test)
CIoU88.35
28
Audio referred image groundingVGG-SS (test)
cIoU48.51
10
Audio referred image groundingPascalSound (test)
cIoU65.23
10
Audio referred image groundingAVSBench (test)
cIoU79.82
10
Audio-Visual Question AnsweringAVQA (val)
Existence Accuracy88.24
9
Audio-Visual Question AnsweringMUSIC-AVQA balanced (test)
Existential Score83.62
8
Audio-Visual CaptioningVALOR 32K (val)
BLEU@416.88
7
Audio-Visual Fact-checkingAVFact
Type 1 F1-score85
7
Image Guided Audio Temporal LocalizationLLP (test)
F1 Score54.96
5
Image Guided Audio Temporal LocalizationAudioSet Strong (test)
F1 Score56.85
5
Showing 10 of 11 rows

Other info

Code

Follow for update