Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Revise: A Framework for Revising OCRed text in Practical Information Systems with Data Contamination Strategy

About

Recent advances in Large Language Models (LLMs) have significantly improved the field of Document AI, demonstrating remarkable performance on document understanding tasks such as question answering. However, existing approaches primarily focus on solving specific tasks, lacking the capability to structurally organize and manage document information. To address this limitation, we propose Revise, a framework that systematically corrects errors introduced by OCR at the character, word, and structural levels. Specifically, Revise employs a comprehensive hierarchical taxonomy of common OCR errors and a synthetic data generation strategy that realistically simulates such errors to train an effective correction model. Experimental results demonstrate that Revise effectively corrects OCR outputs, enabling more structured representation and systematic management of document contents. Consequently, our method significantly enhances downstream performance in document retrieval and question answering tasks, highlighting the potential to overcome the structural management limitations of existing Document AI frameworks.

Gyuho Shim, Seongtae Hong, Heuiseok Lim• 2026

Related benchmarks

TaskDatasetResultRank
Document RetrievalVisualMRC
Recall@160.78
32
Document RetrievalDUDE
Recall@125.17
32
Similarity AssessmentDocVQA
BERTScore51.37
8
Similarity AssessmentCORD
BERTScore54.43
8
Similarity AssessmentFUNSD
BERTScore0.5647
8
OCR error correctionVisualMRC
Wins94
7
OCR error correctionDUDE
Win Rate92
7
Question AnsweringVisualMRC
CIDEr329.2
4
Question AnsweringCORD
F1 Score45
4
Showing 9 of 9 rows

Other info

Follow for update