Combining Residual Networks with LSTMs for Lipreading
About
We propose an end-to-end deep learning architecture for word-level visual speech recognition. The system is a combination of spatiotemporal convolutional, residual and bidirectional Long Short-Term Memory networks. We train and evaluate it on the Lipreading In-The-Wild benchmark, a challenging database of 500-size target-words consisting of 1.28sec video excerpts from BBC TV broadcasts. The proposed network attains word accuracy equal to 83.0, yielding 6.8 absolute improvement over the current state-of-the-art, without using information about word boundaries during training or testing.
Themos Stafylakis, Georgios Tzimiropoulos• 2017
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Lip-reading | LRW-1000 (test) | Accuracy38.2 | 50 | |
| Lip-reading Classification | LRW (test) | Accuracy83 | 38 | |
| Lip-reading | LRW 1.0 (test) | Top-1 Accuracy83 | 37 | |
| Lip-reading | LRW original (test) | Top-1 Accuracy83 | 14 | |
| Word Recognition | LRW (test) | Correct Rate83 | 13 | |
| Lip-reading | LRW Word-level (test) | Accuracy83.5 | 13 | |
| Lip-reading Classification | LRW-1000 cropped mouth regions (test) | Top-1 Accuracy0.382 | 9 | |
| Lip-reading | LRW | Accuracy83 | 6 |
Showing 8 of 8 rows