Index-Preserving Lightweight Token Pruning for Efficient Document Understanding in Vision-Language Models
About
Recent progress in vision-language models (VLMs) has led to impressive results in document understanding tasks, but their high computational demands remain a challenge. To mitigate the compute burdens, we propose a lightweight token pruning framework that filters out non-informative background regions from document images prior to VLM processing. A binary patch-level classifier removes non-text areas, and a max-pooling refinement step recovers fragmented text regions to enhance spatial coherence. Experiments on real-world document datasets demonstrate that our approach substantially lowers computational costs, while maintaining comparable accuracy.
Jaemin Son, Sujin Choi, Inyong Yun• 2025
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Information Extraction | SROIE (test) | F1 Score87.9 | 62 | |
| Document Parsing | SCAN (test) | ANLS61.8 | 4 | |
| Document Parsing | Photo (test) | ANLS71 | 4 | |
| Key Information Extraction | CORD (test) | F1 Score83 | 4 |
Showing 4 of 4 rows