Learning Filterbanks from Raw Speech for Phone Recognition
About
We train a bank of complex filters that operates on the raw waveform and is fed into a convolutional neural network for end-to-end phone recognition. These time-domain filterbanks (TD-filterbanks) are initialized as an approximation of mel-filterbanks, and then fine-tuned jointly with the remaining convolutional architecture. We perform phone recognition experiments on TIMIT and show that for several architectures, models trained on TD-filterbanks consistently outperform their counterparts trained on comparable mel-filterbanks. We get our best performance by learning all front-end steps, from pre-emphasis up to averaging. Finally, we observe that the filters at convergence have an asymmetric impulse response, and that some of them remain almost analytic.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Speech Recognition | WSJ nov93 (dev) | WER6.8 | 52 | |
| Spoof Speech Detection | ASVspoof LA 2021 (eval) | min-tDCF0.2522 | 36 | |
| Speech Recognition | WSJ nov92 (test) | WER3.5 | 34 | |
| Anti-spoofing | ASVspoof LA 2019 (test) | EER1.83 | 32 | |
| Phoneme Recognition | TIMIT (test) | PER18 | 31 | |
| Speech Recognition | Wall Street Journal open vocabulary (dev93) | WER6.8 | 28 | |
| Phoneme Recognition | TIMIT (dev) | PER15.6 | 20 |