Index-Preserving Lightweight Token Pruning for Efficient Document Understanding in Vision-Language Models

About

Recent progress in vision-language models (VLMs) has led to impressive results in document understanding tasks, but their high computational demands remain a challenge. To mitigate the compute burdens, we propose a lightweight token pruning framework that filters out non-informative background regions from document images prior to VLM processing. A binary patch-level classifier removes non-text areas, and a max-pooling refinement step recovers fragmented text regions to enhance spatial coherence. Experiments on real-world document datasets demonstrate that our approach substantially lowers computational costs, while maintaining comparable accuracy.

Jaemin Son, Sujin Choi, Inyong Yun• 2025

Related benchmarks

Task	Dataset	Result
Information Extraction	SROIE (test)	F1 Score87.9	62
Document Parsing	SCAN (test)	ANLS61.8	4
Document Parsing	Photo (test)	ANLS71	4
Key Information Extraction	CORD (test)	F1 Score83	4

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord