FineLAP: Taming Heterogeneous Supervision for Fine-grained Language-Audio Pretraining
About
Contrastively pretrained audio-language models (e.g., CLAP) excel at clip-level understanding but struggle with frame-level tasks. Existing extensions fail to exploit the varying granularity of real-world audio-text data, where massive clip-level textual descriptions coexist with limited frame-level annotations. This paper proposes Fine-grained Language-Audio Pretraining (FineLAP), a novel training paradigm that advances both clip- and frame-level alignment in CLAP with heterogeneous data. FineLAP introduces a dual-stream sigmoid loss with a cluster-based sampling strategy to jointly learn from clip- and frame-level supervision. To capture both global semantics and local details, FineLAP uses a decoupled audio projector on top of a self-supervised encoder. To alleviate the scarcity of temporally annotated data, we present FineLAP-100k, a large-scale synthetic SED dataset constructed through a scalable curation pipeline. Extensive experiments demonstrate that FineLAP achieves SOTA performance across multiple audio understanding tasks, including retrieval, classification, sound event detection, and text-to-audio grounding. Ablation studies further show that coarse- and fine-grained alignment are mutually beneficial, providing insights for building better audio-language models (ALMs).
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Audio Classification | ESC-50 | Accuracy93.9 | 366 | |
| Text-to-Audio Retrieval | AudioCaps (test) | Recall@145.7 | 152 | |
| Audio Classification | Urbansound8K | Accuracy84.9 | 126 | |
| Audio-to-Text Retrieval | Clotho (test) | R@126.6 | 85 | |
| Audio Classification | VGG-Sound | -- | 83 | |
| Audio-to-Text Retrieval | AudioCaps (test) | R@162.5 | 69 | |
| Text-to-Audio Retrieval | Clotho (test) | R@118.9 | 69 | |
| Sound Event Detection | AudioSet Strongly-labeled (test) | -- | 18 | |
| Sound Event Detection | UrbanSED (test) | PSDS10.446 | 6 | |
| Sound Event Detection | DESED (evaluation) | PSDS134.4 | 6 |