Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

AudioMosaic: Contrastive Masked Audio Representation Learning

About

Audio self-supervised learning (SSL) aims to learn general-purpose representations from large-scale unlabeled audio data. While recent advances have been driven mainly by generative reconstruction objectives, contrastive approaches remain less explored, partly due to the difficulty of designing effective audio augmentations and the large batch sizes required for contrastive pre-training. We introduce \textbf{AudioMosaic}, a contrastive learning-based audio encoder for general audio understanding. During pre-training, AudioMosaic constructs positive pairs by applying structured time-frequency masking to spectrogram patches, which reduces memory usage and enables efficient large-batch training. Compared with generative approaches, the AudioMosaic encoder learns more discriminative utterance-level representations that demonstrate strong transferability across datasets, domains, and acoustic conditions. Extensive experiments show that AudioMosaic achieves state-of-the-art performance on several standard audio benchmarks under both linear probing and fine-tuning. We further show that integrating the pretrained AudioMosaic encoder into audio-language models improves performance on audio-language tasks. The code is publicly available in our \href{https://github.com/HanxunH/AudioMosaic}{GitHub repository}.

Hanxun Huang, Qizhou Wang, Xingjun Ma, Cihang Xie, Christopher Leckie, Sarah Erfani• 2026

Related benchmarks

TaskDatasetResultRank
Audio ClassificationESC-50
Accuracy97.5
441
Audio ClassificationAudioSet 20K
mAP42.5
147
Audio ClassificationAudioSet 2M
mAP50.2
98
Speech ClassificationSpeech Commands V2
Accuracy98.4
15
Speech ClassificationSpeech Commands V1
Accuracy99
13
Deepfake DetectionEnvSDD 01 (test)
EER (TTA)0.00e+0
4
Deepfake DetectionEnvSDD 02 (test)
EER (TTA)0.05
4
Deepfake DetectionEnvSDD 03 (test)
EER (TTA)0.38
4
Deepfake DetectionEnvSDD 04 (test)
EER (TTA)4.8
4
Deepfake DetectionEnvSDD Average
EER (TTA)1.3
4
Showing 10 of 10 rows

Other info

Follow for update