DTrOCR: Decoder-only Transformer for Optical Character Recognition
About
Typical text recognition methods rely on an encoder-decoder structure, in which the encoder extracts features from an image, and the decoder produces recognized text from these features. In this study, we propose a simpler and more effective method for text recognition, known as the Decoder-only Transformer for Optical Character Recognition (DTrOCR). This method uses a decoder-only Transformer to take advantage of a generative language model that is pre-trained on a large corpus. We examined whether a generative language model that has been successful in natural language processing can also be effective for text recognition in computer vision. Our experiments demonstrated that DTrOCR outperforms current state-of-the-art methods by a large margin in the recognition of printed, handwritten, and scene text in both English and Chinese.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Scene Text Recognition | SVT (test) | Word Accuracy98.9 | 289 | |
| Scene Text Recognition | IC15 (test) | Word Accuracy93.5 | 210 | |
| Scene Text Recognition | IC13 (test) | Word Accuracy99.4 | 207 | |
| Scene Text Recognition | CUTE 288 samples (test) | Word Accuracy99.1 | 98 | |
| Scene Text Recognition | IIIT5K 3,000 samples (test) | Word Accuracy99.6 | 59 | |
| Scene Text Recognition | SVTP 645 samples (test) | Word Accuracy98.6 | 48 | |
| Text Recognition | Chinese text recognition benchmark | Scene Acc87.4 | 33 | |
| Handwriting Recognition | IAM | CER2.38 | 32 | |
| Handwritten text recognition | IAM-A (test) | CER (%)2.38 | 24 | |
| Handwritten text recognition | IAM Aachen (test) | CER2.38 | 23 |