Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

CogDoc: Towards Unified thinking in Documents

About

Current document reasoning paradigms are constrained by a fundamental trade-off between scalability (processing long-context documents) and fidelity (capturing fine-grained, multimodal details). To bridge this gap, we propose CogDoc, a unified coarse-to-fine thinking framework that mimics human cognitive processes: a low-resolution "Fast Reading" phase for scalable information localization,followed by a high-resolution "Focused Thinking" phase for deep reasoning. We conduct a rigorous investigation into post-training strategies for the unified thinking framework, demonstrating that a Direct Reinforcement Learning (RL) approach outperforms RL with Supervised Fine-Tuning (SFT) initialization. Specifically, we find that direct RL avoids the "policy conflict" observed in SFT. Empirically, our 7B model achieves state-of-the-art performance within its parameter class, notably surpassing significantly larger proprietary models (e.g., GPT-4o) on challenging, visually rich document benchmarks.

Qixin Xu, Haozhe Wang, Che Liu, Fangzhen Lin, Wenhu Chen• 2025

Related benchmarks

TaskDatasetResultRank
Document Visual Question AnsweringSlideVQA
Accuracy0.583
30
Document Visual Question AnsweringDUDE
ANLS46.2
30
Document Visual Question AnsweringMMLongBench-Doc
Accuracy33
29
Document Visual Question AnsweringMP-DocVQA
Accuracy75
10
Showing 4 of 4 rows

Other info

Follow for update