Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

When Large Vision-Language Model Meets Large Remote Sensing Imagery: Coarse-to-Fine Text-Guided Token Pruning

About

Efficient vision-language understanding of large Remote Sensing Images (RSIs) is meaningful but challenging. Current Large Vision-Language Models (LVLMs) typically employ limited pre-defined grids to process images, leading to information loss when handling gigapixel RSIs. Conversely, using unlimited grids significantly increases computational costs. To preserve image details while reducing computational complexity, we propose a text-guided token pruning method with Dynamic Image Pyramid (DIP) integration. Our method introduces: (i) a Region Focus Module (RFM) that leverages text-aware region localization capability to identify critical vision tokens, and (ii) a coarse-to-fine image tile selection and vision token pruning strategy based on DIP, which is guided by RFM outputs and avoids directly processing the entire large imagery. Additionally, existing benchmarks for evaluating LVLMs' perception ability on large RSI suffer from limited question diversity and constrained image sizes. We construct a new benchmark named LRS-VQA, which contains 7,333 QA pairs across 8 categories, with image length up to 27,328 pixels. Our method outperforms existing high-resolution strategies on four datasets using the same data. Moreover, compared to existing token reduction methods, our approach demonstrates higher efficiency under high-resolution settings. Dataset and code are in https://github.com/VisionXLab/LRS-VQA.

Junwei Luo, Yingying Zhang, Xue Yang, Kang Wu, Qi Zhu, Lei Liang, Jingdong Chen, Yansheng Li• 2025

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringRSVQA-HR
Average Score68.7
29
Remote Sensing Perception and ReasoningXLRS-Bench
Average Score (Avg.)42.2
19
Remote Sensing Scene ClassificationEuroSAT--
15
Visual Question AnsweringRSVQA LR
Aggregated Score29.9
14
Remote Sensing Scene ClassificationMillion-AID
F1 Score35.4
10
Visual Question AnsweringXLRS-Bench vqa
F1 Score8.8
10
Image CaptioningXLRS-Bench caption
GEval Score14.7
10
Remote Sensing Scene ClassificationSkyScript bench
F1 Score43.3
10
Image CaptioningVRSBench caption
GEval Score0.193
10
Visual Question AnsweringVRSBench vqa
F1 Score45.7
10
Showing 10 of 13 rows

Other info

Follow for update