Deep Structured Output Learning for Unconstrained Text Recognition
About
We develop a representation suitable for the unconstrained recognition of words in natural images: the general case of no fixed lexicon and unknown length. To this end we propose a convolutional neural network (CNN) based architecture which incorporates a Conditional Random Field (CRF) graphical model, taking the whole word image as a single input. The unaries of the CRF are provided by a CNN that predicts characters at each position of the output, while higher order terms are provided by another CNN that detects the presence of N-grams. We show that this entire model (CRF, character predictor, N-gram predictor) can be jointly optimised by back-propagating the structured output loss, essentially requiring the system to perform multi-task learning, and training uses purely synthetically generated data. The resulting model is a more accurate system on standard real-world text recognition benchmarks than character prediction alone, setting a benchmark for systems that have not been trained on a particular lexicon. In addition, our model achieves state-of-the-art accuracy in lexicon-constrained scenarios, without being specifically modelled for constrained recognition. To test the generalisation of our model, we also perform experiments with random alpha-numeric strings to evaluate the method when no visual language model is applicable.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Scene Text Recognition | SVT (test) | Word Accuracy93.2 | 289 | |
| Scene Text Recognition | IIIT5K (test) | Word Accuracy95.5 | 244 | |
| Scene Text Recognition | IIIT5K | Accuracy95.5 | 149 | |
| Text Recognition | Street View Text (SVT) | Accuracy93.2 | 80 | |
| Scene Text Recognition | IC03 | Accuracy97.8 | 67 | |
| Scene Text Recognition | SVT | -- | 67 | |
| Scene Text Recognition | IC03 (test) | Accuracy93.1 | 63 | |
| Scene Text Recognition | IC 2003 (test) | Word Accuracy97.8 | 38 | |
| Scene Text Recognition | IIIT5K | Accuracy (50 Lexicon)95.5 | 28 | |
| Scene Text Recognition | ICDAR13 (test) | Accuracy90.8 | 24 |