FLAM: Frame-Wise Language-Audio Modeling

About

Recent multi-modal audio-language models (ALMs) excel at text-audio retrieval but struggle with frame-wise audio understanding. Prior works use temporal-aware labels or unsupervised training to improve frame-wise capabilities, but they still lack fine-grained labeling capability to pinpoint when an event occurs. While traditional sound event detection models can precisely localize events, they are limited to pre-defined categories, making them ineffective for real-world scenarios with out-of-distribution events. In this work, we introduce FLAM, an open-vocabulary contrastive audio-language model capable of localizing specific sound events. FLAM employs a memory-efficient and calibrated frame-wise objective with logit adjustment to address spurious correlations, such as event dependencies and label imbalances during training. To enable frame-wise supervision, we leverage a large-scale dataset with diverse audio events, LLM-generated captions and simulation. Experimental results and case studies demonstrate that FLAM significantly improves the open-vocabulary localization capability while maintaining strong performance in global retrieval and downstream tasks.

Yusong Wu, Christos Tsirigotis, Ke Chen, Cheng-Zhi Anna Huang, Aaron Courville, Oriol Nieto, Prem Seetharaman, Justin Salamon• 2025

Related benchmarks

Task	Dataset	Result
Audio Classification	ESC-50	Accuracy86.9	441
Text-to-Audio Retrieval	AudioCaps (test)	Recall@132.1	180
Audio Classification	Urbansound8K	Accuracy75.6	126
Audio-to-Text Retrieval	Clotho (test)	R@116.7	85
Audio Classification	VGG-Sound	--	83
Text-to-Audio Retrieval	Clotho (test)	R@113.8	78
Audio-to-Text Retrieval	AudioCaps (test)	R@143.3	69
Sound Event Detection	AudioSet Strongly-labeled (test)	--	18
Sound Event Detection	AudioSet Strong (407 classes)	PSDS1A0.35	12
Sound Event Detection	UrbanSED (test)	PSDS10.295	6

Showing 10 of 14 rows

Other info

Follow for update

@wizwand_team Discord