AudioMosaic: Contrastive Masked Audio Representation Learning

About

Audio self-supervised learning (SSL) aims to learn general-purpose representations from large-scale unlabeled audio data. While recent advances have been driven mainly by generative reconstruction objectives, contrastive approaches remain less explored, partly due to the difficulty of designing effective audio augmentations and the large batch sizes required for contrastive pre-training. We introduce \textbf{AudioMosaic}, a contrastive learning-based audio encoder for general audio understanding. During pre-training, AudioMosaic constructs positive pairs by applying structured time-frequency masking to spectrogram patches, which reduces memory usage and enables efficient large-batch training. Compared with generative approaches, the AudioMosaic encoder learns more discriminative utterance-level representations that demonstrate strong transferability across datasets, domains, and acoustic conditions. Extensive experiments show that AudioMosaic achieves state-of-the-art performance on several standard audio benchmarks under both linear probing and fine-tuning. We further show that integrating the pretrained AudioMosaic encoder into audio-language models improves performance on audio-language tasks. The code is publicly available in our \href{https://github.com/HanxunH/AudioMosaic}{GitHub repository}.

Hanxun Huang, Qizhou Wang, Xingjun Ma, Cihang Xie, Christopher Leckie, Sarah Erfani• 2026

Related benchmarks

Task	Dataset	Result
Audio Classification	ESC-50	Accuracy97.5	461
Audio Classification	AudioSet 20K	mAP42.5	151
Audio Classification	AudioSet 2M	mAP50.2	102
Speech Classification	Speech Commands V1	Accuracy99	19
Speech Classification	Speech Commands V2	Accuracy98.4	15
Deepfake Detection	EnvSDD 01 (test)	EER (TTA)0.00e+0	4
Deepfake Detection	EnvSDD 02 (test)	EER (TTA)0.05	4
Deepfake Detection	EnvSDD 03 (test)	EER (TTA)0.38	4
Deepfake Detection	EnvSDD 04 (test)	EER (TTA)4.8	4
Deepfake Detection	EnvSDD Average	EER (TTA)1.3	4

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord