Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

A-JEPA: Joint-Embedding Predictive Architecture Can Listen

About

This paper presents that the masked-modeling principle driving the success of large foundational vision models can be effectively applied to audio by making predictions in a latent space. We introduce Audio-based Joint-Embedding Predictive Architecture (A-JEPA), a simple extension method for self-supervised learning from the audio spectrum. Following the design of I-JEPA, our A-JEPA encodes visible audio spectrogram patches with a curriculum masking strategy via context encoder, and predicts the representations of regions sampled at well-designed locations. The target representations of those regions are extracted by the exponential moving average of context encoder, \emph{i.e.}, target encoder, on the whole spectrogram. We find it beneficial to transfer random block masking into time-frequency aware masking in a curriculum manner, considering the complexity of highly correlated in local time and frequency in audio spectrograms. To enhance contextual semantic understanding and robustness, we fine-tune the encoder with a regularized masking on target datasets, instead of input dropping or zero. Empirically, when built with Vision Transformers structure, we find A-JEPA to be highly scalable and sets new state-of-the-art performance on multiple audio and speech classification tasks, outperforming other recent models that use externally supervised pre-training.

Zhengcong Fei, Mingyuan Fan, Junshi Huang• 2023

Related benchmarks

TaskDatasetResultRank
Audio ClassificationESC-50
Accuracy96.3
325
Audio ClassificationAudioSet 20K
mAP38.4
128
Audio RecognitionSpeech Commands V2
Accuracy98.5
43
Audio Event TaggingAudioSet AS-2M (full)
mAP48.6
33
Keyword SpottingSpeech Commands KS1 v1
Accuracy97.7
24
Audio Event TaggingAudioSet (AS-20K)
mAP38.4
24
Keyword SpottingSpeech Commands KS2 v2
Accuracy98.5
23
ClassificationAudioSet AS-2M
mAP (%)48.6
21
Showing 8 of 8 rows

Other info

Follow for update