Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

AaSP: Aliasing-aware Self-Supervised Pre-Training for Audio Spectrogram Transformers

About

Transformer-based audio self-supervised learning (SSL) models commonly use spectrograms, vision-style Transformers, and masked modeling objectives. However, convolutional patchification with temporal downsampling lowers the effective Nyquist frequency and introduces aliasing, while na\"ive low-pass filtering may remove task-relevant high-frequency cues. We present AaSP, an aliasing-aware self-supervised pre-training framework for audio spectrogram transformers. AaSP combines an aliasing-aware patch representation, teacher-student masked modeling, a cross-attention predictor, and multi-mask contrastive regularization to learn representations that integrate features from alias-prone modulation bands while remaining stable across masked views. Its patch-embedding module, Aliasing-aware Patch Embedding (AaPE), augments standard patch tokens with features from alias-prone modulation bands using a band-limited complex sinusoidal kernel with a two-sided exponential window. The kernel's frequency and decay parameters are estimated from the input, enabling adaptive subband analysis whose outputs are fused with standard patch tokens. We pre-train on AudioSet and evaluate the learned representations by fine-tuning and linear evaluation on acoustic/environmental, speech, and music recognition benchmarks. Under fine-tuning, the full AaSP framework achieves state-of-the-art results on AS-20K, ESC-50, and NSynth among compared self-supervised baselines, while remaining competitive elsewhere. Linear evaluation shows a similar trend, including gains on US8K and NSynth. Overall, AaSP learns representations that are more stable under aliasing-sensitive temporal perturbations and competitive for downstream transfer.

Kohei Yamamoto, Kosuke Okusa• 2025

Related benchmarks

TaskDatasetResultRank
Audio ClassificationUrbansound8K
Accuracy89.7
126
Musical Instrument ClassificationNSynth
Accuracy80.5
117
Audio ClassificationSpeech Commands V2 (test)
Accuracy97.9
46
Audio ClassificationCREMA-D
Accuracy68.7
15
Audio ClassificationAudioSet 20K v1
mAP41.9
11
Audio ClassificationAudioSet 2M v1
mAP49.8
10
Audio ClassificationESC-50 v1 (test)
Accuracy0.975
9
Showing 7 of 7 rows

Other info

Follow for update