Combining Residual Networks with LSTMs for Lipreading

About

We propose an end-to-end deep learning architecture for word-level visual speech recognition. The system is a combination of spatiotemporal convolutional, residual and bidirectional Long Short-Term Memory networks. We train and evaluate it on the Lipreading In-The-Wild benchmark, a challenging database of 500-size target-words consisting of 1.28sec video excerpts from BBC TV broadcasts. The proposed network attains word accuracy equal to 83.0, yielding 6.8 absolute improvement over the current state-of-the-art, without using information about word boundaries during training or testing.

Themos Stafylakis, Georgios Tzimiropoulos• 2017

Related benchmarks

Task	Dataset	Result
Lip-reading	LRW-1000 (test)	Accuracy38.2	50
Lip-reading Classification	LRW (test)	Accuracy83	38
Lip-reading	LRW 1.0 (test)	Top-1 Accuracy83	37
Lip-reading	LRW original (test)	Top-1 Accuracy83	14
Word Recognition	LRW (test)	Correct Rate83	13
Lip-reading	LRW Word-level (test)	Accuracy83.5	13
Lip-reading Classification	LRW-1000 cropped mouth regions (test)	Top-1 Accuracy0.382	9
Lip-reading	LRW	Accuracy83	6

Showing 8 of 8 rows

Other info

Code

Follow for update

@wizwand_team Discord