LEAF: A Learnable Frontend for Audio Classification

About

Mel-filterbanks are fixed, engineered audio features which emulate human perception and have been used through the history of audio understanding up to today. However, their undeniable qualities are counterbalanced by the fundamental limitations of handmade representations. In this work we show that we can train a single learnable frontend that outperforms mel-filterbanks on a wide range of audio signals, including speech, music, audio events and animal sounds, providing a general-purpose learned frontend for audio classification. To do so, we introduce a new principled, lightweight, fully learnable architecture that can be used as a drop-in replacement of mel-filterbanks. Our system learns all operations of audio features extraction, from filtering to pooling, compression and normalization, and can be integrated into any neural network at a negligible parameter cost. We perform multi-task training on eight diverse audio classification tasks, and show consistent improvements of our model over mel-filterbanks and previous learnable alternatives. Moreover, our system outperforms the current state-of-the-art learnable frontend on Audioset, with orders of magnitude fewer parameters.

Neil Zeghidour, Olivier Teboul, F\'elix de Chaumont Quitry, Marco Tagliasacchi• 2021

Related benchmarks

Task	Dataset	Result
Musical Instrument Classification	NSynth	Accuracy69.2	123
Spoof Speech Detection	ASVspoof LA 2021 (eval)	min-tDCF0.2753	37
Anti-spoofing	ASVspoof LA 2019 (test)	EER2.49	32
Audio Classification	CREMA-D	Accuracy50.2	26
Audio Classification	NSynth Pitch	Accuracy92.2	8
Audio Classification	VoxForge	Accuracy91.5	5
Audio Classification	BirdCLEF 2021	Accuracy42.3	5
Audio Classification	SpeechCommands v1 v2 (test)	Accuracy95.1	5

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord