Unmute the Patch Tokens: Rethinking Probing in Multi-Label Audio Classification
About
Although probing frozen models has become a standard evaluation paradigm, self-supervised learning in audio defaults to fine-tuning when pursuing state-of-the-art on AudioSet. A key reason is that global pooling creates an information bottleneck causing linear probes to misrepresent the embedding quality: The $\texttt{cls}$-token discards crucial token information about dispersed, localized events in audio. This weakness is rooted in the mismatch between the pretraining objective (globally) and the downstream task (localized). Across a comprehensive benchmark of 13 datasets and 6 spectrogram-based encoders, we investigate the global pooling bottleneck. We introduce binarized prototypical probes: a lightweight and simple pooling method that learns prototypes to perform class-wise information aggregation. Despite its simplicity, our method notably outperforms linear and attentive probing. Our work establishes probing as a competitive and efficient paradigm for evaluating audio SSL models, challenging the reliance on costly fine-tuning.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Audio Classification | ESC-50 | Accuracy97.17 | 441 | |
| Multi-label bioacoustic classification | BirdSet POW | cmAP31.98 | 57 | |
| Multi-label bioacoustic classification | BirdSet PER | cmAP15.48 | 57 | |
| Multi-label bioacoustic classification | BirdSet HSN | cmAP0.3463 | 57 | |
| Multi-label Bioacoustics | NES | mAP26.36 | 54 | |
| Multi-label Bioacoustics | SNE | mAP21.38 | 54 | |
| Multi-label Bioacoustics | UHH | mAP17.27 | 54 | |
| Multi-label Bioacoustics | nbp | mAP42.84 | 54 | |
| Speech Classification | KS2 | Accuracy98.4 | 41 | |
| Audio Classification | AudioSet 20k (train test) | mAP31.67 | 19 |