PILOT: A Promptable Interleaved Layout-aware OCR Transformer
About
Classical OCR pipelines decompose document reading into detection, segmentation, and recognition stages, which makes them sensitive to localization errors and difficult to extend to interactive querying. This work investigates whether a single compact model can jointly perform text recognition and spatial grounding on both handwritten and printed documents. We introduce PILOT, a 155M-parameter prompt-conditioned generative model that formulates document OCR as unified sequence generation. A lightweight depthwise-separable CNN encodes the page, and a Transformer decoder autoregressively emits a single stream of subword and quantized absolute-coordinate tokens on a 10\,px grid, enabling full-page OCR, region-conditioned reading, and query-by-string spotting within the same architecture. A three-stage curriculum, progressing from plain transcription to joint text-and-box generation and finally to prompt-controlled extraction, stabilizes training and improves spatial grounding. Experiments on IAM, RIMES~2009, SROIE~2019, and the heterogeneous MAURDOR benchmark show that PILOT achieves competitive or superior performance in text recognition and line-level detection compared with traditional OCR systems, recent end-to-end HTR models, and compact vision--language models, while remaining substantially smaller than billion-scale multimodal models. Additional evaluations on fine-grained OCR and query-by-string spotting further confirm that a unified text--layout decoder can provide accurate and efficient promptable OCR in a compact setting. To support reproducibility, we release the synthetic SROIE generator, the 500k annotated IDL/PDFA pages, the harmonized line-level annotations for IAM, RIMES~2009, and MAURDOR, and the source code at https://github.com/hamdilaziz/PILOT.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Handwritten text recognition | RIMES | Character Error Rate (CER)5.05 | 26 | |
| Optical Character Recognition | IAM HW, EN | CER4.31 | 17 | |
| Optical Character Recognition | SROIE P, EN | F1 Score93.77 | 15 | |
| Text Detection | SROIE | Precision95.67 | 7 | |
| Optical Character Recognition | MAURDOR French English handwritten and printed full (test) | CER5.23 | 6 | |
| Query-by-String Word Spotting | IAM (test) | mAP@50% IoU88.9 | 5 | |
| Region-level OCR | English documents | Edit Distance3.8 | 4 | |
| Text Detection | RIMES 2009 | Precision96.86 | 4 | |
| Optical Character Recognition | MAURDOR C3 | CER5.86 | 2 | |
| Text Detection | MAURDOR French English handwritten and printed full (test) | F1 Score90.1 | 2 |