LightOnOCR: A 1B End-to-End Multilingual Vision-Language Model for State-of-the-Art OCR

About

We present LightOnOCR-2-1B, a 1B-parameter end-to-end multilingual vision--language model that converts document images (e.g., PDFs) into clean, naturally ordered text without brittle OCR pipelines. Trained on a large-scale, high-quality distillation mix with strong coverage of scans, French documents, and scientific PDFs, LightOnOCR-2 achieves state-of-the-art results on OlmOCR-Bench while being 9$\times$ smaller and substantially faster than prior best-performing models. We further extend the output format to predict normalized bounding boxes for embedded images, introducing localization during pretraining via a resume strategy and refining it with RLVR using IoU-based rewards. Finally, we improve robustness with checkpoint averaging and task-arithmetic merging. We release model checkpoints under Apache 2.0, and publicly release the dataset and LightOnOCR-bbox-bench evaluation under their respective licenses.

Said Taghadouini, Adrien Cavaill\`es, Baptiste Aubertin• 2026

Related benchmarks

Task	Dataset	Result
Document Parsing	olmOCR-bench	ArXiv Processing Accuracy89.6	59
Table Extraction	100 pages (451 tables) synthetic (test)	LLM Score (Overall)9.08	21
Document Parsing	OmniDocBench EN v1.0	Overall Edit Distance0.146	15
Document Parsing	OmniDocBench ZH v1.0	Overall Edit0.255	15
Bounding box detection	LightOnOCR-bbox-bench OlmOCR (290)	F1@0.578	3
Bounding box detection	LightOnOCR-bbox-bench arXiv	F1@0.583	3

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord