Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Read Like Humans: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Recognition

About

Linguistic knowledge is of great benefit to scene text recognition. However, how to effectively model linguistic rules in end-to-end deep networks remains a research challenge. In this paper, we argue that the limited capacity of language models comes from: 1) implicitly language modeling; 2) unidirectional feature representation; and 3) language model with noise input. Correspondingly, we propose an autonomous, bidirectional and iterative ABINet for scene text recognition. Firstly, the autonomous suggests to block gradient flow between vision and language models to enforce explicitly language modeling. Secondly, a novel bidirectional cloze network (BCN) as the language model is proposed based on bidirectional feature representation. Thirdly, we propose an execution manner of iterative correction for language model which can effectively alleviate the impact of noise input. Additionally, based on the ensemble of iterative predictions, we propose a self-training method which can learn from unlabeled images effectively. Extensive experiments indicate that ABINet has superiority on low-quality images and achieves state-of-the-art results on several mainstream benchmarks. Besides, the ABINet trained with ensemble self-training shows promising improvement in realizing human-level recognition. Code is available at https://github.com/FangShancheng/ABINet.

Shancheng Fang, Hongtao Xie, Yuxin Wang, Zhendong Mao, Yongdong Zhang• 2021

Related benchmarks

TaskDatasetResultRank
Scene Text RecognitionSVT (test)
Word Accuracy98.2
289
Scene Text RecognitionIIIT5K (test)
Word Accuracy98.6
244
Scene Text RecognitionIC15 (test)
Word Accuracy90.5
210
Scene Text RecognitionIC13 (test)
Word Accuracy98
207
Scene Text RecognitionSVTP (test)
Word Accuracy94.1
153
Scene Text RecognitionIIIT5K
Accuracy96.2
149
Scene Text RecognitionIC13, IC15, IIIT, SVT, SVTP, CUTE80 Average of 6 benchmarks (test)
Average Accuracy96.01
105
Scene Text RecognitionSVT 647 (test)
Accuracy97.8
101
Scene Text RecognitionCUTE 288 samples (test)
Word Accuracy97.7
98
Scene Text RecognitionCUTE
Accuracy94.1
92
Showing 10 of 114 rows
...

Other info

Code

Follow for update