ASDA: Audio Spectrogram Differential Attention Mechanism for Self-Supervised Representation Learning

About

In recent advancements in audio self-supervised representation learning, the standard Transformer architecture has emerged as the predominant approach, yet its attention mechanism often allocates a portion of attention weights to irrelevant information, potentially impairing the model's discriminative ability. To address this, we introduce a differential attention mechanism, which effectively mitigates ineffective attention allocation through the integration of dual-softmax operations and appropriately tuned differential coefficients. Experimental results demonstrate that our ASDA model achieves state-of-the-art (SOTA) performance across multiple benchmarks, including audio classification (49.0% mAP on AS-2M, 41.5% mAP on AS20K), keyword spotting (98.3% accuracy on SPC-2), and environmental sound classification (96.1% accuracy on ESC-50). These results highlight ASDA's effectiveness in audio tasks, paving the way for broader applications.

Junyu Wang, Tianrui Wang, Meng Ge, Longbiao Wang, Jianwu Dang• 2025

Related benchmarks

Task	Dataset	Result
Audio Classification	ESC-50	Accuracy96.1	441
Audio Classification	AudioSet 20K	mAP41.5	147
Audio Classification	Speech Commands V2 (test)	Accuracy98.3	46
Audio Recognition	Speech Commands V2	Accuracy98.3	43
Classification	AudioSet AS-2M	mAP (%)49	21
Audio Classification	AudioSet 20K v1	mAP41.5	11
Audio Classification	AudioSet 2M v1	mAP49	10
Audio Classification	ESC-50 v1 (test)	Accuracy0.961	9

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord