SLAP: Scalable Language-Audio Pretraining with Variable-Duration Audio and Multi-Objective Training
About
Contrastive language-audio pretraining (CLAP) has achieved notable success in learning semantically rich audio representations and is widely adopted for various audio-related tasks. However, current CLAP models face several key limitations. First, they are typically trained on relatively small datasets, often comprising a few million audio samples. Second, existing CLAP models are restricted to short and fixed duration, which constrains their usage in real-world scenarios with variable-duration audio. Third, the standard contrastive training objective operates on global representations, which may hinder the learning of dense, fine-grained audio features. To address these challenges, we introduce Scalable Language-Audio Pretraining (SLAP), which scales language-audio pretraining to 109 million audio-text pairs with variable audio durations and incorporates multiple training objectives. SLAP unifies contrastive loss with additional self-supervised and captioning losses in a single-stage training, facilitating the learning of richer dense audio representations. The proposed SLAP model achieves new state-of-the-art performance on audio-text retrieval and zero-shot audio classification tasks, demonstrating its effectiveness across diverse benchmarks.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text-to-Audio Retrieval | AudioCaps (test) | Recall@147.5 | 145 | |
| Audio Captioning | AudioCaps (test) | CIDEr75.1 | 140 | |
| Audio Classification | ESC-50 (test) | Accuracy98.2 | 84 | |
| Audio-to-Text Retrieval | Clotho (test) | R@136.8 | 78 | |
| Audio-to-Text Retrieval | AudioCaps (test) | R@163.4 | 62 | |
| Text-to-Audio Retrieval | Clotho (test) | R@127.2 | 62 | |
| Audio Classification | GTZAN | Accuracy80.5 | 54 | |
| Audio Classification | Speech Commands V2 (test) | Accuracy98.5 | 35 | |
| Audio Captioning | Clotho (test) | METEOR18.1 | 21 | |
| Audio Tagging | AudioSet (test) | mAP47.8 | 14 |