Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

ASDA: Audio Spectrogram Differential Attention Mechanism for Self-Supervised Representation Learning

About

In recent advancements in audio self-supervised representation learning, the standard Transformer architecture has emerged as the predominant approach, yet its attention mechanism often allocates a portion of attention weights to irrelevant information, potentially impairing the model's discriminative ability. To address this, we introduce a differential attention mechanism, which effectively mitigates ineffective attention allocation through the integration of dual-softmax operations and appropriately tuned differential coefficients. Experimental results demonstrate that our ASDA model achieves state-of-the-art (SOTA) performance across multiple benchmarks, including audio classification (49.0% mAP on AS-2M, 41.5% mAP on AS20K), keyword spotting (98.3% accuracy on SPC-2), and environmental sound classification (96.1% accuracy on ESC-50). These results highlight ASDA's effectiveness in audio tasks, paving the way for broader applications.

Junyu Wang, Tianrui Wang, Meng Ge, Longbiao Wang, Jianwu Dang• 2025

Related benchmarks

TaskDatasetResultRank
Audio ClassificationESC-50
Accuracy96.1
325
Audio ClassificationAudioSet 20K
mAP41.5
128
Audio RecognitionSpeech Commands V2
Accuracy98.3
43
Audio ClassificationSpeech Commands V2 (test)
Accuracy98.3
35
ClassificationAudioSet AS-2M
mAP (%)49
21
Audio ClassificationAudioSet 20K v1
mAP41.5
11
Audio ClassificationAudioSet 2M v1
mAP49
10
Audio ClassificationESC-50 v1 (test)
Accuracy0.961
9
Showing 8 of 8 rows

Other info

Follow for update