Meerkat: Audio-Visual Large Language Model for Grounding in Space and Time

About

Leveraging Large Language Models' remarkable proficiency in text-based tasks, recent works on Multi-modal LLMs (MLLMs) extend them to other modalities like vision and audio. However, the progress in these directions has been mostly focused on tasks that only require a coarse-grained understanding of the audio-visual semantics. We present Meerkat, an audio-visual LLM equipped with a fine-grained understanding of image and audio both spatially and temporally. With a new modality alignment module based on optimal transport and a cross-attention module that enforces audio-visual consistency, Meerkat can tackle challenging tasks such as audio referred image grounding, image guided audio temporal localization, and audio-visual fact-checking. Moreover, we carefully curate a large dataset AVFIT that comprises 3M instruction tuning samples collected from open-source datasets, and introduce MeerkatBench that unifies five challenging audio-visual tasks. We achieve state-of-the-art performance on all these downstream tasks with a relative improvement of up to 37.12%.

Sanjoy Chowdhury, Sayan Nag, Subhrajyoti Dasgupta, Jun Chen, Mohamed Elhoseiny, Ruohan Gao, Dinesh Manocha• 2024

Related benchmarks

Task	Dataset	Result
Sound Source Localization	Flickr SoundNet (test)	CIoU88.35	49
Audio referred image grounding	VGG-SS (test)	cIoU48.51	10
Audio referred image grounding	PascalSound (test)	cIoU65.23	10
Audio referred image grounding	AVSBench (test)	cIoU79.82	10
Audio-Visual Question Answering	AVQA (val)	Existence Accuracy88.24	9
Audio-Visual Question Answering	MUSIC-AVQA balanced (test)	Existential Score83.62	8
Audio-Visual Captioning	VALOR 32K (val)	BLEU@416.88	7
Audio-Visual Fact-checking	AVFact	Type 1 F1-score85	7
Image Guided Audio Temporal Localization	LLP (test)	F1 Score54.96	5
Image Guided Audio Temporal Localization	AudioSet Strong (test)	F1 Score56.85	5

Showing 10 of 11 rows

Other info

Code

Follow for update

@wizwand_team Discord