Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Learning Filterbanks from Raw Speech for Phone Recognition

About

We train a bank of complex filters that operates on the raw waveform and is fed into a convolutional neural network for end-to-end phone recognition. These time-domain filterbanks (TD-filterbanks) are initialized as an approximation of mel-filterbanks, and then fine-tuned jointly with the remaining convolutional architecture. We perform phone recognition experiments on TIMIT and show that for several architectures, models trained on TD-filterbanks consistently outperform their counterparts trained on comparable mel-filterbanks. We get our best performance by learning all front-end steps, from pre-emphasis up to averaging. Finally, we observe that the filters at convergence have an asymmetric impulse response, and that some of them remain almost analytic.

Neil Zeghidour, Nicolas Usunier, Iasonas Kokkinos, Thomas Schatz, Gabriel Synnaeve, Emmanuel Dupoux• 2017

Related benchmarks

TaskDatasetResultRank
Speech RecognitionWSJ nov93 (dev)
WER6.8
52
Spoof Speech DetectionASVspoof LA 2021 (eval)
min-tDCF0.2522
36
Speech RecognitionWSJ nov92 (test)
WER3.5
34
Anti-spoofingASVspoof LA 2019 (test)
EER1.83
32
Phoneme RecognitionTIMIT (test)
PER18
31
Speech RecognitionWall Street Journal open vocabulary (dev93)
WER6.8
28
Phoneme RecognitionTIMIT (dev)
PER15.6
20
Showing 7 of 7 rows

Other info

Follow for update