Learning Filterbanks from Raw Speech for Phone Recognition

About

We train a bank of complex filters that operates on the raw waveform and is fed into a convolutional neural network for end-to-end phone recognition. These time-domain filterbanks (TD-filterbanks) are initialized as an approximation of mel-filterbanks, and then fine-tuned jointly with the remaining convolutional architecture. We perform phone recognition experiments on TIMIT and show that for several architectures, models trained on TD-filterbanks consistently outperform their counterparts trained on comparable mel-filterbanks. We get our best performance by learning all front-end steps, from pre-emphasis up to averaging. Finally, we observe that the filters at convergence have an asymmetric impulse response, and that some of them remain almost analytic.

Neil Zeghidour, Nicolas Usunier, Iasonas Kokkinos, Thomas Schatz, Gabriel Synnaeve, Emmanuel Dupoux• 2017

Related benchmarks

Task	Dataset	Result
Speech Recognition	WSJ nov93 (dev)	WER6.8	52
Spoof Speech Detection	ASVspoof LA 2021 (eval)	min-tDCF0.2522	36
Speech Recognition	WSJ nov92 (test)	WER3.5	34
Phoneme Recognition	TIMIT (test)	PER18	33
Anti-spoofing	ASVspoof LA 2019 (test)	EER1.83	32
Speech Recognition	Wall Street Journal open vocabulary (dev93)	WER6.8	28
Phoneme Recognition	TIMIT (dev)	PER15.6	20

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord