Scaling up masked audio encoder learning for general audio classification

About

Despite progress in audio classification, a generalization gap remains between speech and other sound domains, such as environmental sounds and music. Models trained for speech tasks often fail to perform well on environmental or musical audio tasks, and vice versa. While self-supervised (SSL) audio representations offer an alternative, there has been limited exploration of scaling both model and dataset sizes for SSL-based general audio classification. We introduce Dasheng, a simple SSL audio encoder, based on the efficient masked autoencoder framework. Trained with 1.2 billion parameters on 272,356 hours of diverse audio, Dasheng obtains significant performance gains on the HEAR benchmark. It outperforms previous works on CREMA-D, LibriCount, Speech Commands, VoxLingua, and competes well in music and environment classification. Dasheng features inherently contain rich speech, music, and environmental information, as shown in nearest-neighbor classification experiments. Code is available https://github.com/richermans/dasheng/.

Heinrich Dinkel, Zhiyong Yan, Yongqing Wang, Junbo Zhang, Yujun Wang, Bin Wang• 2024

Related benchmarks

Task	Dataset	Result
Music Genre Classification	GTZAN	Accuracy88.6	62
Audio Representation Evaluation	HEAR (Holistic Evaluation of Audio Representations)	HEAR Average81.44	47
Speech Emotion Recognition	RAVDESS	--	43
Fault Diagnosis	RMIS Fault Diagnosis Suite (IICA, IIEE, WTPG, MaFaulDa, SDUST, UMGED, PU)	Overall Mean Score53.66	28
Speaker Counting	Libricount	Score72.8	26
Language Identification	VoxLingua33	Accuracy86	26
Speaker Identification	LibriSpeech MF	Score97.5	26
Speech Processing	SUPERB	KWS Acc0.9773	24
Vocal Sound Classification	VocalSound	Accuracy92.5	21
Bioacoustic Analysis	Beans	wtkn77.3	20

Showing 10 of 34 rows

Other info

Follow for update

@wizwand_team Discord