Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Doc-V*:Coarse-to-Fine Interactive Visual Reasoning for Multi-Page Document VQA

About

Multi-page Document Visual Question Answering requires reasoning over semantics, layouts, and visual elements in long, visually dense documents. Existing OCR-free methods face a trade-off between capacity and precision: end-to-end models scale poorly with document length, while visual retrieval-based pipelines are brittle and passive. We propose Doc-$V^*$, an \textbf{OCR-free agentic} framework that casts multi-page DocVQA as sequential evidence aggregation. Doc-$V^*$ begins with a thumbnail overview, then actively navigates via semantic retrieval and targeted page fetching, and aggregates evidence in a structured working memory for grounded reasoning. Trained by imitation learning from expert trajectories and further optimized with Group Relative Policy Optimization, Doc-$V^*$ balances answer accuracy with evidence-seeking efficiency. Across five benchmarks, Doc-$V^*$ outperforms open-source baselines and approaches proprietary models, improving out-of-domain performance by up to \textbf{47.9\%} over RAG baseline. Other results reveal effective evidence aggregation with selective attention, not increased input pages.

Yuanlei Zheng, Pei Fu, Hang Li, Ziyang Wang, Yuyi Zhang, Wenyu Ruan, Xiaojin Zhang, Zhongyu Wei, Zhenbo Luo, Jian Luan, Wei Chen, Xiang Bai• 2026

Related benchmarks

TaskDatasetResultRank
Long-context document understandingMMLongBench-Doc
Accuracy42.1
58
Document Visual Question AnsweringSlideVQA
F1 Score0.772
32
Multi-page Document Question AnsweringMP-DocVQA
ANLS86.2
27
Multi-page Document UnderstandingDUDE
ANLS64.5
21
Long-context document understandingLongDocURL
Accuracy56.3
14
Showing 5 of 5 rows

Other info

Follow for update