Benchmarking Vision-Language Models on Chinese Ancient Documents: From OCR to Knowledge Reasoning

About

Chinese ancient documents, invaluable carriers of millennia of Chinese history and culture, hold rich knowledge across diverse fields but face challenges in digitization and understanding, i.e., traditional methods only scan images, while current Vision-Language Models (VLMs) struggle with their visual and linguistic complexity. Existing document benchmarks focus on English printed texts or simplified Chinese, leaving a gap for evaluating VLMs on ancient Chinese documents. To address this, we present AncientDoc, the first benchmark for Chinese ancient documents, designed to assess VLMs from OCR to knowledge reasoning. AncientDoc includes five tasks (page-level OCR, vernacular translation, reasoning-based QA, knowledge-based QA, linguistic variant QA) and covers 14 document types, over 100 books, and about 3,000 pages. Based on AncientDoc, we evaluate mainstream VLMs using multiple metrics, supplemented by a human-aligned large language model for scoring.

Haiyang Yu, Yuchuan Wu, Fan Shi, Lei Liao, Jinghui Lu, Xiaodong Ge, Han Wang, Minghan Zhuo, Xuecheng Wu, Xiang Fei, Hao Feng, Guozhi Tang, An-Lan Wang, Hanshen Zhu, Yangfan He, Quanhuan Liang, Liyuan Meng, Chao Feng, Can Huang, Jingqun Tang, Bin Li• 2025

Related benchmarks

Task	Dataset	Result
Deepfake Detection	DFDC	AUC71.91	230
Deepfake Detection	DFD	AUC0.8074	193
Deepfake Detection	DFDC (test)	--	130
Deepfake Detection	DFD (test)	--	81
Image Deepfake Detection	DFo	AUC0.8241	62
Deepfake Detection	DFDCP (test)	AUC80.51	56
Deepfake Detection	DFDCP	AUC0.7594	35
Deepfake Detection	FF++ Intra-dataset c23	AUC99.99	24
Deepfake Detection	FF++ Intra-dataset (c40)	Accuracy96.68	15
Deepfake Detection	CDF	AUC75.27	12

Showing 10 of 11 rows

Other info

Follow for update

@wizwand_team Discord