Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Joint Visual Semantic Reasoning: Multi-Stage Decoder for Text Recognition

About

Although text recognition has significantly evolved over the years, state-of-the-art (SOTA) models still struggle in the wild scenarios due to complex backgrounds, varying fonts, uncontrolled illuminations, distortions and other artefacts. This is because such models solely depend on visual information for text recognition, thus lacking semantic reasoning capabilities. In this paper, we argue that semantic information offers a complementary role in addition to visual only. More specifically, we additionally utilize semantic information by proposing a multi-stage multi-scale attentional decoder that performs joint visual-semantic reasoning. Our novelty lies in the intuition that for text recognition, the prediction should be refined in a stage-wise manner. Therefore our key contribution is in designing a stage-wise unrolling attentional decoder where non-differentiability, invoked by discretely predicted character labels, needs to be bypassed for end-to-end training. While the first stage predicts using visual features, subsequent stages refine on top of it using joint visual-semantic information. Additionally, we introduce multi-scale 2D attention along with dense and residual connections between different stages to deal with varying scales of character sizes, for better performance and faster convergence during training. Experimental results show our approach to outperform existing SOTA methods by a considerable margin.

Ayan Kumar Bhunia, Aneeshan Sain, Amandeep Kumar, Shuvozit Ghose, Pinaki Nath Chowdhury, Yi-Zhe Song• 2021

Related benchmarks

TaskDatasetResultRank
Scene Text RecognitionSVT (test)
Word Accuracy92.2
289
Scene Text RecognitionIIIT5K (test)
Word Accuracy95.2
244
Scene Text RecognitionIC15 (test)
Word Accuracy84
210
Scene Text RecognitionIC13 (test)
Word Accuracy95.5
207
Scene Text RecognitionSVTP (test)
Word Accuracy85.7
153
Scene Text RecognitionSVT 647 (test)
Accuracy92.2
101
Scene Text RecognitionCUTE 288 samples (test)
Word Accuracy89.7
98
Scene Text RecognitionCUTE (test)
Accuracy89.7
59
Scene Text RecognitionIIIT5K 3,000 samples (test)
Word Accuracy95.2
59
Scene Text RecognitionSVTP 645 (test)
Accuracy85.7
54
Showing 10 of 26 rows

Other info

Follow for update