AaSP: Aliasing-aware Self-Supervised Pre-Training for Audio Spectrogram Transformers
About
Transformer-based audio self-supervised learning (SSL) models commonly use spectrograms, vision-style Transformers, and masked modeling objectives. However, convolutional patchification with temporal downsampling lowers the effective Nyquist frequency and introduces aliasing, while na\"ive low-pass filtering may remove task-relevant high-frequency cues. We present AaSP, an aliasing-aware self-supervised pre-training framework for audio spectrogram transformers. AaSP combines an aliasing-aware patch representation, teacher-student masked modeling, a cross-attention predictor, and multi-mask contrastive regularization to learn representations that integrate features from alias-prone modulation bands while remaining stable across masked views. Its patch-embedding module, Aliasing-aware Patch Embedding (AaPE), augments standard patch tokens with features from alias-prone modulation bands using a band-limited complex sinusoidal kernel with a two-sided exponential window. The kernel's frequency and decay parameters are estimated from the input, enabling adaptive subband analysis whose outputs are fused with standard patch tokens. We pre-train on AudioSet and evaluate the learned representations by fine-tuning and linear evaluation on acoustic/environmental, speech, and music recognition benchmarks. Under fine-tuning, the full AaSP framework achieves state-of-the-art results on AS-20K, ESC-50, and NSynth among compared self-supervised baselines, while remaining competitive elsewhere. Linear evaluation shows a similar trend, including gains on US8K and NSynth. Overall, AaSP learns representations that are more stable under aliasing-sensitive temporal perturbations and competitive for downstream transfer.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Audio Classification | Urbansound8K | Accuracy89.7 | 126 | |
| Musical Instrument Classification | NSynth | Accuracy80.5 | 117 | |
| Audio Classification | Speech Commands V2 (test) | Accuracy97.9 | 46 | |
| Audio Classification | CREMA-D | Accuracy68.7 | 15 | |
| Audio Classification | AudioSet 20K v1 | mAP41.9 | 11 | |
| Audio Classification | AudioSet 2M v1 | mAP49.8 | 10 | |
| Audio Classification | ESC-50 v1 (test) | Accuracy0.975 | 9 |