Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

PILOT: A Promptable Interleaved Layout-aware OCR Transformer

About

Classical OCR pipelines decompose document reading into detection, segmentation, and recognition stages, which makes them sensitive to localization errors and difficult to extend to interactive querying. This work investigates whether a single compact model can jointly perform text recognition and spatial grounding on both handwritten and printed documents. We introduce PILOT, a 155M-parameter prompt-conditioned generative model that formulates document OCR as unified sequence generation. A lightweight depthwise-separable CNN encodes the page, and a Transformer decoder autoregressively emits a single stream of subword and quantized absolute-coordinate tokens on a 10\,px grid, enabling full-page OCR, region-conditioned reading, and query-by-string spotting within the same architecture. A three-stage curriculum, progressing from plain transcription to joint text-and-box generation and finally to prompt-controlled extraction, stabilizes training and improves spatial grounding. Experiments on IAM, RIMES~2009, SROIE~2019, and the heterogeneous MAURDOR benchmark show that PILOT achieves competitive or superior performance in text recognition and line-level detection compared with traditional OCR systems, recent end-to-end HTR models, and compact vision--language models, while remaining substantially smaller than billion-scale multimodal models. Additional evaluations on fine-grained OCR and query-by-string spotting further confirm that a unified text--layout decoder can provide accurate and efficient promptable OCR in a compact setting. To support reproducibility, we release the synthetic SROIE generator, the 500k annotated IDL/PDFA pages, the harmonized line-level annotations for IAM, RIMES~2009, and MAURDOR, and the source code at https://github.com/hamdilaziz/PILOT.

Laziz Hamdi, Amine Tamasna, Pascal Boisson, Thierry Paquet• 2025

Related benchmarks

TaskDatasetResultRank
Handwritten text recognitionRIMES
Character Error Rate (CER)5.05
26
Optical Character RecognitionIAM HW, EN
CER4.31
17
Optical Character RecognitionSROIE P, EN
F1 Score93.77
15
Text DetectionSROIE
Precision95.67
7
Optical Character RecognitionMAURDOR French English handwritten and printed full (test)
CER5.23
6
Query-by-String Word SpottingIAM (test)
mAP@50% IoU88.9
5
Region-level OCREnglish documents
Edit Distance3.8
4
Text DetectionRIMES 2009
Precision96.86
4
Optical Character RecognitionMAURDOR C3
CER5.86
2
Text DetectionMAURDOR French English handwritten and printed full (test)
F1 Score90.1
2
Showing 10 of 13 rows

Other info

Follow for update