FLAM: Frame-Wise Language-Audio Modeling
About
Recent multi-modal audio-language models (ALMs) excel at text-audio retrieval but struggle with frame-wise audio understanding. Prior works use temporal-aware labels or unsupervised training to improve frame-wise capabilities, but they still lack fine-grained labeling capability to pinpoint when an event occurs. While traditional sound event detection models can precisely localize events, they are limited to pre-defined categories, making them ineffective for real-world scenarios with out-of-distribution events. In this work, we introduce FLAM, an open-vocabulary contrastive audio-language model capable of localizing specific sound events. FLAM employs a memory-efficient and calibrated frame-wise objective with logit adjustment to address spurious correlations, such as event dependencies and label imbalances during training. To enable frame-wise supervision, we leverage a large-scale dataset with diverse audio events, LLM-generated captions and simulation. Experimental results and case studies demonstrate that FLAM significantly improves the open-vocabulary localization capability while maintaining strong performance in global retrieval and downstream tasks.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Audio Classification | ESC-50 | Accuracy86.9 | 374 | |
| Text-to-Audio Retrieval | AudioCaps (test) | Recall@132.1 | 152 | |
| Audio Classification | Urbansound8K | Accuracy75.6 | 126 | |
| Audio-to-Text Retrieval | Clotho (test) | R@116.7 | 85 | |
| Audio Classification | VGG-Sound | -- | 83 | |
| Audio-to-Text Retrieval | AudioCaps (test) | R@143.3 | 69 | |
| Text-to-Audio Retrieval | Clotho (test) | R@113.8 | 69 | |
| Sound Event Detection | AudioSet Strongly-labeled (test) | -- | 18 | |
| Sound Event Detection | AudioSet Strong (407 classes) | PSDS1A0.35 | 12 | |
| Sound Event Detection | UrbanSED (test) | PSDS10.295 | 6 |