Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models

About

Text recognition is a long-standing research problem for document digitalization. Existing approaches are usually built based on CNN for image understanding and RNN for char-level text generation. In addition, another language model is usually needed to improve the overall accuracy as a post-processing step. In this paper, we propose an end-to-end text recognition approach with pre-trained image Transformer and text Transformer models, namely TrOCR, which leverages the Transformer architecture for both image understanding and wordpiece-level text generation. The TrOCR model is simple but effective, and can be pre-trained with large-scale synthetic data and fine-tuned with human-labeled datasets. Experiments show that the TrOCR model outperforms the current state-of-the-art models on the printed, handwritten and scene text recognition tasks. The TrOCR models and code are publicly available at \url{https://aka.ms/trocr}.

Minghao Li, Tengchao Lv, Jingye Chen, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, Furu Wei• 2021

Related benchmarks

TaskDatasetResultRank
Scene Text RecognitionSVT (test)
Word Accuracy96.1
289
Scene Text RecognitionIIIT5K (test)
Word Accuracy94.1
244
Scene Text RecognitionIC15 (test)
Word Accuracy88.1
210
Scene Text RecognitionIC13 (test)
Word Accuracy98.3
207
Scene Text RecognitionSVTP (test)
Word Accuracy93
153
Scene Text RecognitionIC13, IC15, IIIT, SVT, SVTP, CUTE80 Average of 6 benchmarks (test)
Average Accuracy93.23
105
Handwritten text recognitionIAM (test)
CER3.4
102
Scene Text RecognitionSVT 647 (test)
Accuracy96.1
101
Scene Text RecognitionCUTE 288 samples (test)
Word Accuracy95.1
98
Scene Text RecognitionCUTE
Accuracy89.6
92
Showing 10 of 33 rows

Other info

Code

Follow for update