Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Unmute the Patch Tokens: Rethinking Probing in Multi-Label Audio Classification

About

Although probing frozen models has become a standard evaluation paradigm, self-supervised learning in audio defaults to fine-tuning when pursuing state-of-the-art on AudioSet. A key reason is that global pooling creates an information bottleneck causing linear probes to misrepresent the embedding quality: The $\texttt{cls}$-token discards crucial token information about dispersed, localized events in audio. This weakness is rooted in the mismatch between the pretraining objective (globally) and the downstream task (localized). Across a comprehensive benchmark of 13 datasets and 6 spectrogram-based encoders, we investigate the global pooling bottleneck. We introduce binarized prototypical probes: a lightweight and simple pooling method that learns prototypes to perform class-wise information aggregation. Despite its simplicity, our method notably outperforms linear and attentive probing. Our work establishes probing as a competitive and efficient paradigm for evaluating audio SSL models, challenging the reliance on costly fine-tuning.

Lukas Rauch, Ren\'e Heinrich, Houtan Ghaffari, Lukas Miklautz, Ilyass Moummad, Bernhard Sick, Christoph Scholz• 2025

Related benchmarks

TaskDatasetResultRank
Audio ClassificationESC-50
Accuracy97.17
441
Multi-label bioacoustic classificationBirdSet POW
cmAP31.98
57
Multi-label bioacoustic classificationBirdSet PER
cmAP15.48
57
Multi-label bioacoustic classificationBirdSet HSN
cmAP0.3463
57
Multi-label BioacousticsNES
mAP26.36
54
Multi-label BioacousticsSNE
mAP21.38
54
Multi-label BioacousticsUHH
mAP17.27
54
Multi-label Bioacousticsnbp
mAP42.84
54
Speech ClassificationKS2
Accuracy98.4
41
Audio ClassificationAudioSet 20k (train test)
mAP31.67
19
Showing 10 of 14 rows

Other info

Follow for update