AaSP: Aliasing-aware Self-Supervised Pre-Training for Audio Spectrogram Transformers

About

Transformer-based audio self-supervised learning (SSL) models commonly use spectrograms, vision-style Transformers, and masked modeling objectives. However, convolutional patchification with temporal downsampling lowers the effective Nyquist frequency and introduces aliasing, while na\"ive low-pass filtering may remove task-relevant high-frequency cues. We present AaSP, an aliasing-aware self-supervised pre-training framework for audio spectrogram transformers. AaSP combines an aliasing-aware patch representation, teacher-student masked modeling, a cross-attention predictor, and multi-mask contrastive regularization to learn representations that integrate features from alias-prone modulation bands while remaining stable across masked views. Its patch-embedding module, Aliasing-aware Patch Embedding (AaPE), augments standard patch tokens with features from alias-prone modulation bands using a band-limited complex sinusoidal kernel with a two-sided exponential window. The kernel's frequency and decay parameters are estimated from the input, enabling adaptive subband analysis whose outputs are fused with standard patch tokens. We pre-train on AudioSet and evaluate the learned representations by fine-tuning and linear evaluation on acoustic/environmental, speech, and music recognition benchmarks. Under fine-tuning, the full AaSP framework achieves state-of-the-art results on AS-20K, ESC-50, and NSynth among compared self-supervised baselines, while remaining competitive elsewhere. Linear evaluation shows a similar trend, including gains on US8K and NSynth. Overall, AaSP learns representations that are more stable under aliasing-sensitive temporal perturbations and competitive for downstream transfer.

Kohei Yamamoto, Kosuke Okusa• 2025

Related benchmarks

Task	Dataset	Result
Audio Classification	Urbansound8K	Accuracy89.7	126
Musical Instrument Classification	NSynth	Accuracy80.5	123
Audio Classification	Speech Commands V2 (test)	Accuracy97.9	59
Audio Classification	CREMA-D	Accuracy68.7	26
Audio Classification	AudioSet 20K v1	mAP41.9	11
Audio Classification	AudioSet 2M v1	mAP49.8	10
Audio Classification	ESC-50 v1 (test)	Accuracy0.975	9

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord