Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Focus, Don't Prune: Identifying Instruction-Relevant Regions for Information-Rich Image Understanding

About

Large Vision-Language Models (LVLMs) have shown strong performance across various multimodal tasks by leveraging the reasoning capabilities of Large Language Models (LLMs). However, processing visually complex and information-rich images, such as infographics or document layouts, requires these models to generate a large number of visual tokens, leading to significant computational overhead. To address this, we propose PinPoint, a novel two-stage framework that first identifies instruction-relevant image regions and then refines them to extract fine-grained visual features for improved reasoning and efficiency. Central to our approach is the Instruction-Region Alignment, which localizes relevant regions using both visual input and textual instructions. We further introduce new annotations that provide richer ground-truth supervision for instruction-relevant regions across challenging VQA benchmarks: InfographicVQA, MultiPageDocVQA, and SinglePageDocVQA. Experimental results show that PinPoint not only achieves superior accuracy compared to existing methods but also reduces computational overhead by minimizing irrelevant visual tokens.

Mincheol Kwon, Minseung Lee, Seonga Choi, Miso Choi, Kyeong-Jin Oh, Hyunyoung Lee, Cheonyoung Park, Yongho Song, Seunghyun Park, Jinkyu Kim• 2026

Related benchmarks

TaskDatasetResultRank
Hallucination EvaluationAMBER
CHAIR8
172
Multimodal UnderstandingMMMU
MMMU Score35
59
Visual Question AnsweringInfoVQA
ANLS Score71.4
31
Visual Question AnsweringTextVQA
Score72.93
20
Visual Question AnsweringSPDocVQA
ANLS89.77
12
Visual Question AnsweringMPDocVQA
ANLS0.6723
12
Visual Question AnsweringGQA
Accuracy76.24
12
Object Hallucination EvaluationMSCOCO (val)
CHAIRS25.6
6
Multimodal UnderstandingMMMU-Pro standard 10
Score19.9
4
Hallucination EvaluationMSCOCO
Rand. Score89
2
Showing 10 of 10 rows

Other info

Follow for update