SCATTER: Selective Context Attentional Scene Text Recognizer
About
Scene Text Recognition (STR), the task of recognizing text against complex image backgrounds, is an active area of research. Current state-of-the-art (SOTA) methods still struggle to recognize text written in arbitrary shapes. In this paper, we introduce a novel architecture for STR, named Selective Context ATtentional Text Recognizer (SCATTER). SCATTER utilizes a stacked block architecture with intermediate supervision during training, that paves the way to successfully train a deep BiLSTM encoder, thus improving the encoding of contextual dependencies. Decoding is done using a two-step 1D attention mechanism. The first attention step re-weights visual features from a CNN backbone together with contextual features computed by a BiLSTM layer. The second attention step, similar to previous papers, treats the features as a sequence and attends to the intra-sequence relationships. Experiments show that the proposed approach surpasses SOTA performance on irregular text recognition benchmarks by 3.7\% on average.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Scene Text Recognition | SVT (test) | Word Accuracy89.2 | 289 | |
| Scene Text Recognition | IIIT5K (test) | Word Accuracy92.9 | 244 | |
| Scene Text Recognition | SVTP (test) | Word Accuracy84.5 | 153 | |
| Scene Text Recognition | IIIT5K | Accuracy93.7 | 149 | |
| Scene Text Recognition | CUTE | Accuracy87.5 | 92 | |
| Scene Text Recognition | CUTE80 (test) | Accuracy0.851 | 87 | |
| Scene Text Recognition | SVT | Accuracy92.7 | 67 | |
| Scene Text Recognition | IC03 | Accuracy96.3 | 67 | |
| Scene Text Recognition | IC13 | Accuracy94.7 | 66 | |
| Scene Text Recognition | IC 2013 (test) | Accuracy93.8 | 51 |