CNN Architectures for Large-Scale Audio Classification

About

Convolutional Neural Networks (CNNs) have proven very effective in image classification and show promise for audio. We use various CNN architectures to classify the soundtracks of a dataset of 70M training videos (5.24 million hours) with 30,871 video-level labels. We examine fully connected Deep Neural Networks (DNNs), AlexNet [1], VGG [2], Inception [3], and ResNet [4]. We investigate varying the size of both training set and label vocabulary, finding that analogs of the CNNs used in image classification do well on our audio classification task, and larger training and label sets help up to a point. A model using embeddings from these classifiers does much better than raw features on the Audio Set [5] Acoustic Event Detection (AED) classification task.

Shawn Hershey, Sourish Chaudhuri, Daniel P. W. Ellis, Jort F. Gemmeke, Aren Jansen, R. Channing Moore, Manoj Plakal, Devin Platt, Rif A. Saurous, Bryan Seybold, Malcolm Slaney, Ron J. Weiss, Kevin Wilson• 2016

Related benchmarks

Task	Dataset	Result
Multimodal Emotion Recognition	MER 2023	F1 Score54.81	30
Taste Prediction	Unified music-taste corpus (test)	Sweet Score0.87	18
Speech to gesture translation	Speech2Gesture 1.0 (test)	Fooled Rate (%)26.4	12
Fake News Video Detection	FakeSV (five-fold cross-val)	Accuracy66.91	12
Audio Quality Assessment	DCASE Task 7 System-level n=9 2023	FAD0.367	8
Classification	Beans	Accuracy (bats)75	7
Detection	Beans	dcase0.372	7
Voice Activity Detection	AVA-Speech (test)	--	7
Speech to gesture translation	Speech2Gesture Oliver 1.0 (test)	Percentage Fooled36.9	6
Audio Quality Assessment	DCASE Task 7 Per-category granularity 2023	FAD0.113	6

Showing 10 of 15 rows

Other info

Follow for update

@wizwand_team Discord