Structural Anchor Pruning: Training-Free Multi-Vector Compression for Visual Document Retrieval

About

Recent Vision-Language Models (e.g., ColPali) enable fine-grained Visual Document Retrieval (VDR) but incur prohibitive multi-vector index storage overhead. Existing training-free pruning methods either rely on heuristic layer choices or degrade sharply under aggressive compression, leading prior work to argue that effective high-compression pruning requires query-dependent training. We challenge this view with Structural Anchor Pruning (SAP), a self-calibrating, training-free, and query-agnostic index-time pruning framework with three components: (i) Score Retention (SR), a white-box per-layer compression diagnostic; (ii) SR-guided window selection, a procedure that automatically locates the structural pruning region for any backbone with no per-model hyperparameters; and (iii) a visual in-degree centrality scorer that identifies anchor patches within the selected window. On the ViDoRe v1/v2 benchmarks across three architectures spanning 18, 28, and 36 backbone layers, SAP retains over 90\% of NDCG@5 while pruning more than 90\% of visual tokens, without any per-model parameter tuning. Our layer-resolved SR analysis reveals an Alignment-Aggregation Divergence: the document's visual structure is preserved as a stable ``Structural Plateau'' within the backbone, but the final layers reshape this representation into a sparse, query-aligned form that is no longer suitable for pruning. This is the mechanistic reason SAP succeeds where final-layer methods fail.

Zhuchenyang Liu, Ziyu Hu, Yao Zhang, Yu Xiao• 2026

Related benchmarks

Task	Dataset	Result
Visual document retrieval	ViDoRe Avg. across 4 datasets v2	Full NDCG0.58	45
Image-Text Retrieval	ImageCoDe	R@14.8	21
Image-Text Retrieval	DocVQA	R@145.5	21
Image-Text Retrieval	Flickr30K	Recall@169.3	21
Image-Text Retrieval	MSCOCO	Recall@140.7	21

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord