WAVE: Learning Unified & Versatile Audio-Visual Embeddings with Multimodal LLM

About

While embeddings from multimodal large language models (LLMs) excel as general-purpose representations, their application to dynamic modalities like audio and video remains underexplored. We introduce WAVE (\textbf{u}nified \& \textbf{v}ersatile \textbf{a}udio-\textbf{v}isual \textbf{e}mbeddings), the first LLM-based embedding that creates a unified representation space for text, audio, and video modalities. WAVE employs a novel hierarchical feature fusion strategy and a joint multi-modal, multi-task training approach to enable two key capabilities: any-to-any cross-modal retrieval and the generation of prompt-aware embeddings tailored to user instructions. Experimentally, WAVE sets a new state-of-the-art on the MMEB-v2 video benchmark and achieves superior results in audio and video-to-audio retrieval. Its prompt-aware nature also yields remarkable performance in multimodal question answering, significantly outperforming existing embedding models. Ablation studies validate our joint training strategy, demonstrating improved performance across all modalities. With a newly introduced benchmark for versatile audio-visual learning, WAVE opens up broad possibilities for cross-modal, any-to-any applications. Our code and checkpoints are released at \href{https://github.com/TCL606/WAVE}{https://github.com/TCL606/WAVE}.

Changli Tang, Qinfan Xiao, Ke Mei, Tianyi Wang, Fengyun Rao, Chao Zhang• 2025

Related benchmarks

Task	Dataset	Result
Video-to-Text retrieval	VATEX	--	84
Audio-to-Text Retrieval	AudioCaps (test)	R@144.2	69
Audio-to-Text Retrieval	Clotho	--	49
Video Understanding	MMEB Video v2	Classification Score (CLS)57.8	17
Audio-to-Video Retrieval	VGGSound (test)	Recall@125	13
Video Question Answering	MMEB Video QA v2 (test)	Average Score72.5	6
Multi-modal Cross-modal Retrieval	OmniRetriever Bench	Recall@1 (T->V)41.27	4
Video Retrieval	LoVR	Text-to-Clip Score62.9	3
Audio QA	MMAU mini (test)	Accuracy0.766	2
Audio QA	MMAR (test)	Accuracy68.1	2

Showing 10 of 14 rows

Other info

Follow for update

@wizwand_team Discord