Levenshtein OCR

About

A novel scene text recognizer based on Vision-Language Transformer (VLT) is presented. Inspired by Levenshtein Transformer in the area of NLP, the proposed method (named Levenshtein OCR, and LevOCR for short) explores an alternative way for automatically transcribing textual content from cropped natural images. Specifically, we cast the problem of scene text recognition as an iterative sequence refinement process. The initial prediction sequence produced by a pure vision model is encoded and fed into a cross-modal transformer to interact and fuse with the visual features, to progressively approximate the ground truth. The refinement process is accomplished via two basic character-level operations: deletion and insertion, which are learned with imitation learning and allow for parallel decoding, dynamic length change and good interpretability. The quantitative experiments clearly demonstrate that LevOCR achieves state-of-the-art performances on standard benchmarks and the qualitative analyses verify the effectiveness and advantage of the proposed LevOCR algorithm. Code is available at https://github.com/AlibabaResearch/AdvancedLiterateMachinery/tree/main/OCR/LevOCR.

Cheng Da, Peng Wang, Cong Yao• 2022

Related benchmarks

Task	Dataset	Result
Scene Text Recognition	CUTE80	Accuracy91.7	59
Scene Text Recognition	IIIT5K 3000 (test)	Accuracy96.6	51
Scene Text Recognition	ICDAR 2015	Accuracy (No Lexicon)84	41
Scene Text Recognition	SVT Perspective	Accuracy88.1	37
Text Recognition	IIIT, SVT, IC13, IC15, SVTP, CT	IIIT Acc96.6	37
Scene Text Recognition	ICDAR 2013	Accuracy95.9	33
Scene Text Recognition	SVT 647 images	Accuracy92.9	33
Scene Text Recognition	IIIT5K-Words (3000)	Accuracy95.2	22
Scene Text Recognition	Street View Text 647	Accuracy90.6	22
Scene Text Recognition	SVT Perspective (645)	Accuracy84.2	22

Showing 10 of 13 rows

Other info

Follow for update

@wizwand_team Discord