SLAP: Scalable Language-Audio Pretraining with Variable-Duration Audio and Multi-Objective Training

About

Contrastive language-audio pretraining (CLAP) has achieved notable success in learning semantically rich audio representations and is widely adopted for various audio-related tasks. However, current CLAP models face several key limitations. First, they are typically trained on relatively small datasets, often comprising a few million audio samples. Second, existing CLAP models are restricted to short and fixed duration, which constrains their usage in real-world scenarios with variable-duration audio. Third, the standard contrastive training objective operates on global representations, which may hinder the learning of dense, fine-grained audio features. To address these challenges, we introduce Scalable Language-Audio Pretraining (SLAP), which scales language-audio pretraining to 109 million audio-text pairs with variable audio durations and incorporates multiple training objectives. SLAP unifies contrastive loss with additional self-supervised and captioning losses in a single-stage training, facilitating the learning of richer dense audio representations. The proposed SLAP model achieves new state-of-the-art performance on audio-text retrieval and zero-shot audio classification tasks, demonstrating its effectiveness across diverse benchmarks.

Xinhao Mei, Gael Le Lan, Haohe Liu, Zhaoheng Ni, Varun Nagaraja, Yang Liu, Yangyang Shi, Vikas Chandra• 2026

Related benchmarks

Task	Dataset	Result
Audio Captioning	AudioCaps (test)	CIDEr75.1	222
Text-to-Audio Retrieval	AudioCaps (test)	Recall@147.5	191
Audio Classification	ESC-50 (test)	Accuracy98.2	111
Audio-to-Text Retrieval	Clotho (test)	R@136.8	92
Text-to-Audio Retrieval	Clotho (test)	R@127.2	85
Audio-to-Text Retrieval	AudioCaps (test)	R@163.4	80
Audio Classification	GTZAN	Accuracy80.5	65
Audio Classification	Speech Commands V2 (test)	Accuracy98.5	59
Audio Captioning	Clotho (test)	METEOR18.1	43
Audio Classification	US8K	Top-1 Accuracy83.5	30

Showing 10 of 14 rows

Other info

Follow for update

@wizwand_team Discord