GLM-OCR Technical Report
About
GLM-OCR is an efficient 0.9B-parameter compact multimodal model designed for real-world document understanding. It combines a 0.4B-parameter CogViT visual encoder with a 0.5B-parameter GLM language decoder, achieving a strong balance between computational efficiency and recognition performance. To address the inefficiency of standard autoregressive decoding in deterministic OCR tasks, GLM-OCR introduces a Multi-Token Prediction (MTP) mechanism that predicts multiple tokens per step, significantly improving decoding throughput while keeping memory overhead low through shared parameters. At the system level, a two-stage pipeline is adopted: PP-DocLayout-V3 first performs layout analysis, followed by parallel region-level recognition. Extensive evaluations on public benchmarks and industrial scenarios show that GLM-OCR achieves competitive or state-of-the-art performance in document parsing, text and formula transcription, table structure recovery, and key information extraction. Its compact architecture and structured generation make it suitable for both resource-constrained edge deployment and large-scale production systems.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Document Parsing | OmniDocBench v1.5 | Overall Score94.62 | 195 | |
| Document Parsing | OmniDocBench 1.5 (test) | Text Edit Error0.04 | 111 | |
| Document Parsing | OmniDocBench Real5 warping | Overall Score90.68 | 32 | |
| Document Parsing | Real5-OmniDocBench (screen-photography) | Overall Score91.75 | 32 | |
| Document Parsing | OmniDocBench Real5 skewing variation | Overall Score85.39 | 32 | |
| Document Parsing | OmniDocBench Real5 | Score91.12 | 26 | |
| Document Parsing | OmniDocBench Full v1.6 | Overall Accuracy95.15 | 21 | |
| Table Recognition | PubTabNet | Overall Score85.2 | 14 | |
| Document Parsing | OmniDocBench Scanning Real5 | Overall Score92.67 | 13 | |
| Formula Recognition | UniMERNet SCE | CDM97.77 | 9 |