Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Levenshtein OCR

About

A novel scene text recognizer based on Vision-Language Transformer (VLT) is presented. Inspired by Levenshtein Transformer in the area of NLP, the proposed method (named Levenshtein OCR, and LevOCR for short) explores an alternative way for automatically transcribing textual content from cropped natural images. Specifically, we cast the problem of scene text recognition as an iterative sequence refinement process. The initial prediction sequence produced by a pure vision model is encoded and fed into a cross-modal transformer to interact and fuse with the visual features, to progressively approximate the ground truth. The refinement process is accomplished via two basic character-level operations: deletion and insertion, which are learned with imitation learning and allow for parallel decoding, dynamic length change and good interpretability. The quantitative experiments clearly demonstrate that LevOCR achieves state-of-the-art performances on standard benchmarks and the qualitative analyses verify the effectiveness and advantage of the proposed LevOCR algorithm. Code is available at https://github.com/AlibabaResearch/AdvancedLiterateMachinery/tree/main/OCR/LevOCR.

Cheng Da, Peng Wang, Cong Yao• 2022

Related benchmarks

TaskDatasetResultRank
Scene Text RecognitionIIIT5K 3000 (test)
Accuracy96.6
51
Scene Text RecognitionCUTE80
Accuracy91.7
47
Scene Text RecognitionSVT Perspective
Accuracy88.1
37
Text RecognitionIIIT, SVT, IC13, IC15, SVTP, CT
IIIT Acc96.6
37
Scene Text RecognitionICDAR 2015
Accuracy (No Lexicon)84
35
Scene Text RecognitionSVT 647 images
Accuracy92.9
33
Scene Text RecognitionICDAR 2013
Accuracy95.9
27
Scene Text RecognitionIIIT5K-Words (3000)
Accuracy95.2
22
Scene Text RecognitionStreet View Text 647
Accuracy90.6
22
Scene Text RecognitionSVT Perspective (645)
Accuracy84.2
22
Showing 10 of 13 rows

Other info

Follow for update