Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

SLAP: Scalable Language-Audio Pretraining with Variable-Duration Audio and Multi-Objective Training

About

Contrastive language-audio pretraining (CLAP) has achieved notable success in learning semantically rich audio representations and is widely adopted for various audio-related tasks. However, current CLAP models face several key limitations. First, they are typically trained on relatively small datasets, often comprising a few million audio samples. Second, existing CLAP models are restricted to short and fixed duration, which constrains their usage in real-world scenarios with variable-duration audio. Third, the standard contrastive training objective operates on global representations, which may hinder the learning of dense, fine-grained audio features. To address these challenges, we introduce Scalable Language-Audio Pretraining (SLAP), which scales language-audio pretraining to 109 million audio-text pairs with variable audio durations and incorporates multiple training objectives. SLAP unifies contrastive loss with additional self-supervised and captioning losses in a single-stage training, facilitating the learning of richer dense audio representations. The proposed SLAP model achieves new state-of-the-art performance on audio-text retrieval and zero-shot audio classification tasks, demonstrating its effectiveness across diverse benchmarks.

Xinhao Mei, Gael Le Lan, Haohe Liu, Zhaoheng Ni, Varun Nagaraja, Yang Liu, Yangyang Shi, Vikas Chandra• 2026

Related benchmarks

TaskDatasetResultRank
Text-to-Audio RetrievalAudioCaps (test)
Recall@147.5
145
Audio CaptioningAudioCaps (test)
CIDEr75.1
140
Audio ClassificationESC-50 (test)
Accuracy98.2
84
Audio-to-Text RetrievalClotho (test)
R@136.8
78
Audio-to-Text RetrievalAudioCaps (test)
R@163.4
62
Text-to-Audio RetrievalClotho (test)
R@127.2
62
Audio ClassificationGTZAN
Accuracy80.5
54
Audio ClassificationSpeech Commands V2 (test)
Accuracy98.5
35
Audio CaptioningClotho (test)
METEOR18.1
21
Audio TaggingAudioSet (test)
mAP47.8
14
Showing 10 of 14 rows

Other info

Follow for update