Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

FineLAP: Taming Heterogeneous Supervision for Fine-grained Language-Audio Pretraining

About

Contrastively pretrained audio-language models (e.g., CLAP) excel at clip-level understanding but struggle with frame-level tasks. Existing extensions fail to exploit the varying granularity of real-world audio-text data, where massive clip-level textual descriptions coexist with limited frame-level annotations. This paper proposes Fine-grained Language-Audio Pretraining (FineLAP), a novel training paradigm that advances both clip- and frame-level alignment in CLAP with heterogeneous data. FineLAP introduces a dual-stream sigmoid loss with a cluster-based sampling strategy to jointly learn from clip- and frame-level supervision. To capture both global semantics and local details, FineLAP uses a decoupled audio projector on top of a self-supervised encoder. To alleviate the scarcity of temporally annotated data, we present FineLAP-100k, a large-scale synthetic SED dataset constructed through a scalable curation pipeline. Extensive experiments demonstrate that FineLAP achieves SOTA performance across multiple audio understanding tasks, including retrieval, classification, sound event detection, and text-to-audio grounding. Ablation studies further show that coarse- and fine-grained alignment are mutually beneficial, providing insights for building better audio-language models (ALMs).

Xiquan Li, Xuenan Xu, Ziyang Ma, Wenxi Chen, Haolin He, Qiuqiang Kong, Xie Chen• 2026

Related benchmarks

TaskDatasetResultRank
Audio ClassificationESC-50
Accuracy93.9
366
Text-to-Audio RetrievalAudioCaps (test)
Recall@145.7
152
Audio ClassificationUrbansound8K
Accuracy84.9
126
Audio-to-Text RetrievalClotho (test)
R@126.6
85
Audio ClassificationVGG-Sound--
83
Audio-to-Text RetrievalAudioCaps (test)
R@162.5
69
Text-to-Audio RetrievalClotho (test)
R@118.9
69
Sound Event DetectionAudioSet Strongly-labeled (test)--
18
Sound Event DetectionUrbanSED (test)
PSDS10.446
6
Sound Event DetectionDESED (evaluation)
PSDS134.4
6
Showing 10 of 11 rows

Other info

Follow for update