Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

The Patrologia Graeca Corpus: OCR, Annotation, and Open Release of Noisy Nineteenth-Century Polytonic Greek Editions

About

We present the Patrologia Graeca Corpus, the first large-scale open OCR and linguistic resource for nineteenthcentury editions of Ancient Greek. The collection covers the remaining undigitized volumes of the Patrologia Graeca (PG), printed in complex bilingual (Greek-Latin) layouts and characterized by highly degraded polytonic Greek typography. Through a dedicated pipeline combining YOLO-based layout detection and CRNN-based text recognition, we achieve a character error rate (CER) of 1.05% and a word error rate (WER) of 4.69%, largely outperforming existing OCR systems for polytonic Greek. The resulting corpus contains around six million lemmatized and part-of-speech tagged tokens, aligned with full OCR and layout annotations. Beyond its philological value, this corpus establishes a new benchmark for OCR on noisy polytonic Greek and provides training material for future models, including LLMs.

Chahan Vidal-Gor\`ene, Bastien Kindt• 2026

Related benchmarks

TaskDatasetResultRank
Layout Detection30-page PG (test)
Precision96.9
8
Text Recognition30-page PG (test)
CER (%)1.05
4
Line Detection30-page PG (test)
Precision98.3
1
Reading order30-page PG (test)
Precision98
1
Showing 4 of 4 rows

Other info

Follow for update