The IBM 2016 English Conversational Telephone Speech Recognition System
About
We describe a collection of acoustic and language modeling techniques that lowered the word error rate of our English conversational telephone LVCSR system to a record 6.6% on the Switchboard subset of the Hub5 2000 evaluation testset. On the acoustic side, we use a score fusion of three strong models: recurrent nets with maxout activations, very deep convolutional nets with 3x3 kernels, and bidirectional long short-term memory nets which operate on FMLLR and i-vector features. On the language modeling side, we use an updated model "M" and hierarchical neural network LMs.
George Saon, Tom Sercu, Steven Rennie, Hong-Kwang J. Kuo• 2016
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Automatic Speech Recognition | SWITCHBOARD swbd | WER6.6 | 39 | |
| Speech Recognition | Hub5'00 CH (test) | WER12.2 | 28 | |
| Automatic Speech Recognition | NIST CTS CallHome 2000 | WER12.2 | 27 | |
| Automatic Speech Recognition | Hub5 2000 (SWB) | WER6.6 | 21 |
Showing 4 of 4 rows