Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Combining Residual Networks with LSTMs for Lipreading

About

We propose an end-to-end deep learning architecture for word-level visual speech recognition. The system is a combination of spatiotemporal convolutional, residual and bidirectional Long Short-Term Memory networks. We train and evaluate it on the Lipreading In-The-Wild benchmark, a challenging database of 500-size target-words consisting of 1.28sec video excerpts from BBC TV broadcasts. The proposed network attains word accuracy equal to 83.0, yielding 6.8 absolute improvement over the current state-of-the-art, without using information about word boundaries during training or testing.

Themos Stafylakis, Georgios Tzimiropoulos• 2017

Related benchmarks

TaskDatasetResultRank
Lip-readingLRW-1000 (test)
Accuracy38.2
50
Lip-reading ClassificationLRW (test)
Accuracy83
38
Lip-readingLRW 1.0 (test)
Top-1 Accuracy83
37
Lip-readingLRW original (test)
Top-1 Accuracy83
14
Word RecognitionLRW (test)
Correct Rate83
13
Lip-readingLRW Word-level (test)
Accuracy83.5
13
Lip-reading ClassificationLRW-1000 cropped mouth regions (test)
Top-1 Accuracy0.382
9
Lip-readingLRW
Accuracy83
6
Showing 8 of 8 rows

Other info

Code

Follow for update