PILOT: A Promptable Interleaved Layout-aware OCR Transformer

About

Classical OCR pipelines decompose document reading into detection, segmentation, and recognition stages, which makes them sensitive to localization errors and difficult to extend to interactive querying. This work investigates whether a single compact model can jointly perform text recognition and spatial grounding on both handwritten and printed documents. We introduce PILOT, a 155M-parameter prompt-conditioned generative model that formulates document OCR as unified sequence generation. A lightweight depthwise-separable CNN encodes the page, and a Transformer decoder autoregressively emits a single stream of subword and quantized absolute-coordinate tokens on a 10\,px grid, enabling full-page OCR, region-conditioned reading, and query-by-string spotting within the same architecture. A three-stage curriculum, progressing from plain transcription to joint text-and-box generation and finally to prompt-controlled extraction, stabilizes training and improves spatial grounding. Experiments on IAM, RIMES~2009, SROIE~2019, and the heterogeneous MAURDOR benchmark show that PILOT achieves competitive or superior performance in text recognition and line-level detection compared with traditional OCR systems, recent end-to-end HTR models, and compact vision--language models, while remaining substantially smaller than billion-scale multimodal models. Additional evaluations on fine-grained OCR and query-by-string spotting further confirm that a unified text--layout decoder can provide accurate and efficient promptable OCR in a compact setting. To support reproducibility, we release the synthetic SROIE generator, the 500k annotated IDL/PDFA pages, the harmonized line-level annotations for IAM, RIMES~2009, and MAURDOR, and the source code at https://github.com/hamdilaziz/PILOT.

Laziz Hamdi, Amine Tamasna, Pascal Boisson, Thierry Paquet• 2025

Related benchmarks

Task	Dataset	Result
Handwritten text recognition	RIMES	Character Error Rate (CER)5.05	26
Optical Character Recognition	IAM HW, EN	CER4.31	17
Optical Character Recognition	SROIE P, EN	F1 Score93.77	15
Text Detection	SROIE	Precision95.67	7
Optical Character Recognition	MAURDOR French English handwritten and printed full (test)	CER5.23	6
Query-by-String Word Spotting	IAM (test)	mAP@50% IoU88.9	5
Region-level OCR	English documents	Edit Distance3.8	4
Text Detection	RIMES 2009	Precision96.86	4
Optical Character Recognition	MAURDOR C3	CER5.86	2
Text Detection	MAURDOR French English handwritten and printed full (test)	F1 Score90.1	2

Showing 10 of 13 rows

Other info

Follow for update

@wizwand_team Discord