Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

FLAM: Frame-Wise Language-Audio Modeling

About

Recent multi-modal audio-language models (ALMs) excel at text-audio retrieval but struggle with frame-wise audio understanding. Prior works use temporal-aware labels or unsupervised training to improve frame-wise capabilities, but they still lack fine-grained labeling capability to pinpoint when an event occurs. While traditional sound event detection models can precisely localize events, they are limited to pre-defined categories, making them ineffective for real-world scenarios with out-of-distribution events. In this work, we introduce FLAM, an open-vocabulary contrastive audio-language model capable of localizing specific sound events. FLAM employs a memory-efficient and calibrated frame-wise objective with logit adjustment to address spurious correlations, such as event dependencies and label imbalances during training. To enable frame-wise supervision, we leverage a large-scale dataset with diverse audio events, LLM-generated captions and simulation. Experimental results and case studies demonstrate that FLAM significantly improves the open-vocabulary localization capability while maintaining strong performance in global retrieval and downstream tasks.

Yusong Wu, Christos Tsirigotis, Ke Chen, Cheng-Zhi Anna Huang, Aaron Courville, Oriol Nieto, Prem Seetharaman, Justin Salamon• 2025

Related benchmarks

TaskDatasetResultRank
Sound Event DetectionAudioSet Strong (407 classes)
PSDS1A0.35
12
Sound Event DetectionUrbanSED 10 classes
PSDS1A0.3
3
Sound Event DetectionASFX-SED
AUROC81
3
Sound Event DetectionDESED 10 classes
PSDS1A0.09
3
Showing 4 of 4 rows

Other info

Follow for update