Consensus Entropy: Harnessing Multi-VLM Agreement for Self-Verifying and Self-Improving OCR
About
Optical Character Recognition (OCR) is fundamental to Vision-Language Models (VLMs) and high-quality data generation for LLM training. Yet, despite progress in average OCR accuracy, state-of-the-art VLMs still struggle with detecting sample-level errors and lack effective unsupervised quality control. We introduce Consensus Entropy (CE), a training-free, model-agnostic metric that estimates output reliability by measuring inter-model agreement entropy. The core insight is that correct predictions converge in output space, while errors diverge. Based on CE, we develop CE-OCR, a lightweight multi-model framework that verifies outputs by ensemble agreement, selects the best outputs, and further improves efficiency through adaptive routing. Experiments demonstrate that CE is robust for quality verification, improving F1 scores by 42.1\% over VLM-as-Judge. CE-OCR achieves consistent OCR gains, outperforming self-consistency and single-model baselines at the same cost. Notably, CE requires no training or supervision, enabling plug-and-play integration.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Optical Character Recognition | OCRBench | Score922 | 232 | |
| Visual Question Answering | ScienceVQA | -- | 36 | |
| OCR Verification | Curated dataset of 1,000 PDF pages 1.0 (test) | F1 Score70.88 | 30 | |
| Multimodal Optical Character Recognition | OCRBench v2 | En Accuracy71.6 | 5 | |
| Visual Question Answering | Scene-VQA easy | Accuracy98 | 2 | |
| Visual Question Answering | Doc-VQA easy | Accuracy90.5 | 2 | |
| Visual Question Answering | Formula easy | Accuracy88 | 2 | |
| Visual Question Answering | Math-VQA | Accuracy45.6 | 2 | |
| Visual Question Answering | Knowledge-Reasoning | Accuracy66.3 | 2 | |
| Visual Question Answering | Visual-Understanding | Accuracy82.4 | 2 |