Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

FLAM: Frame-Wise Language-Audio Modeling

About

Recent multi-modal audio-language models (ALMs) excel at text-audio retrieval but struggle with frame-wise audio understanding. Prior works use temporal-aware labels or unsupervised training to improve frame-wise capabilities, but they still lack fine-grained labeling capability to pinpoint when an event occurs. While traditional sound event detection models can precisely localize events, they are limited to pre-defined categories, making them ineffective for real-world scenarios with out-of-distribution events. In this work, we introduce FLAM, an open-vocabulary contrastive audio-language model capable of localizing specific sound events. FLAM employs a memory-efficient and calibrated frame-wise objective with logit adjustment to address spurious correlations, such as event dependencies and label imbalances during training. To enable frame-wise supervision, we leverage a large-scale dataset with diverse audio events, LLM-generated captions and simulation. Experimental results and case studies demonstrate that FLAM significantly improves the open-vocabulary localization capability while maintaining strong performance in global retrieval and downstream tasks.

Yusong Wu, Christos Tsirigotis, Ke Chen, Cheng-Zhi Anna Huang, Aaron Courville, Oriol Nieto, Prem Seetharaman, Justin Salamon• 2025

Related benchmarks

TaskDatasetResultRank
Audio ClassificationESC-50
Accuracy86.9
374
Text-to-Audio RetrievalAudioCaps (test)
Recall@132.1
152
Audio ClassificationUrbansound8K
Accuracy75.6
126
Audio-to-Text RetrievalClotho (test)
R@116.7
85
Audio ClassificationVGG-Sound--
83
Audio-to-Text RetrievalAudioCaps (test)
R@143.3
69
Text-to-Audio RetrievalClotho (test)
R@113.8
69
Sound Event DetectionAudioSet Strongly-labeled (test)--
18
Sound Event DetectionAudioSet Strong (407 classes)
PSDS1A0.35
12
Sound Event DetectionUrbanSED (test)
PSDS10.295
6
Showing 10 of 14 rows

Other info

Follow for update