Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

PARCEL: Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding

About

Large Vision-Language Models (LVLMs) map visual inputs into dense token sequences, imposing a quadratic computational bottleneck for inference. Elastic visual-token compression addresses this by training a single model that can run at multiple visual-token budgets. However, existing approaches struggle under aggressive compression. Spatial-only compression, as in nested pooling, behaves as an imperfect low-pass filter and induces spectral aliasing that obscures fine-grained detail. Query-only compression, as in nested query resampling, replaces explicit grid-aligned tokens with non-local summaries and substantially degrades spatial grounding. To resolve this representational conflict, we introduce PARCEL (Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding), a visual tokenization architecture that dynamically partitions the labor of feature extraction. PARCEL establishes spatial pool tokens as low-frequency layout anchors and conditions elastic query tokens on these anchors through Pool-Conditioned Query Resampling. This encourages query tokens to focus on complementary visual features rather than redundant spatial mapping. Extensive evaluations across 27 benchmarks show that PARCEL improves the performance-efficiency Pareto frontier, consistently outperforming existing matryoshka baselines across visual-token budgets while preserving the "train once, deploy anywhere" paradigm.

Selim Kuzucu, Alessio Tonioni, Vasile Lup, Bernt Schiele, Federico Tombari, Muhammad Ferjad Naeem• 2026

Related benchmarks

TaskDatasetResultRank
Video Question AnsweringActivityNet-QA--
418
Document Visual Question AnsweringDocVQA (val)
Accuracy63.7
166
Chart Question AnsweringChartQA augmented
Accuracy85.8
26
Chart Visual Question AnsweringChartQA human
Score37.2
10
Referring Image SegmentationRefCOCO avg
Score63.6
10
Referring Image SegmentationRefCOCO+ avg
Score58.4
10
Referring Image SegmentationRefCOCO-g Avg.
Score57.8
10
Video CaptioningMSRVTT Cap
Score68.9
10
Video UnderstandingVideo benchmarks Aggregated
Macro Average Score53.1
9
Vision-Language Understanding (Image)RefCOCO Resolution-Sensitive and General
Absolute Macro Score70.4
9
Showing 10 of 13 rows

Other info

Follow for update