Primitive Representation Learning for Scene Text Recognition
About
Scene text recognition is a challenging task due to diverse variations of text instances in natural scene images. Conventional methods based on CNN-RNN-CTC or encoder-decoder with attention mechanism may not fully investigate stable and efficient feature representations for multi-oriented scene texts. In this paper, we propose a primitive representation learning method that aims to exploit intrinsic representations of scene text images. We model elements in feature maps as the nodes of an undirected graph. A pooling aggregator and a weighted aggregator are proposed to learn primitive representations, which are transformed into high-level visual text representations by graph convolutional networks. A Primitive REpresentation learning Network (PREN) is constructed to use the visual text representations for parallel decoding. Furthermore, by integrating visual text representations into an encoder-decoder model with the 2D attention mechanism, we propose a framework called PREN2D to alleviate the misalignment problem in attention-based methods. Experimental results on both English and Chinese scene text recognition tasks demonstrate that PREN keeps a balance between accuracy and efficiency, while PREN2D achieves state-of-the-art performance.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Scene Text Recognition | SVT (test) | Word Accuracy94 | 289 | |
| Scene Text Recognition | IIIT5K (test) | Word Accuracy95.6 | 244 | |
| Scene Text Recognition | IC15 (test) | Word Accuracy83 | 210 | |
| Scene Text Recognition | IC13 (test) | Word Accuracy96.4 | 207 | |
| Scene Text Recognition | SVTP (test) | Word Accuracy87.6 | 153 | |
| Scene Text Recognition | IIIT5K | Accuracy95.6 | 149 | |
| Scene Text Recognition | SVT 647 (test) | Accuracy94 | 101 | |
| Scene Text Recognition | CUTE 288 samples (test) | Word Accuracy91.7 | 98 | |
| Scene Text Recognition | CUTE | Accuracy91.7 | 92 | |
| Scene Text Recognition | CUTE80 (test) | Accuracy0.917 | 87 |