PARCEL: Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding
About
Large Vision-Language Models (LVLMs) map visual inputs into dense token sequences, imposing a quadratic computational bottleneck for inference. Elastic visual-token compression addresses this by training a single model that can run at multiple visual-token budgets. However, existing approaches struggle under aggressive compression. Spatial-only compression, as in nested pooling, behaves as an imperfect low-pass filter and induces spectral aliasing that obscures fine-grained detail. Query-only compression, as in nested query resampling, replaces explicit grid-aligned tokens with non-local summaries and substantially degrades spatial grounding. To resolve this representational conflict, we introduce PARCEL (Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding), a visual tokenization architecture that dynamically partitions the labor of feature extraction. PARCEL establishes spatial pool tokens as low-frequency layout anchors and conditions elastic query tokens on these anchors through Pool-Conditioned Query Resampling. This encourages query tokens to focus on complementary visual features rather than redundant spatial mapping. Extensive evaluations across 27 benchmarks show that PARCEL improves the performance-efficiency Pareto frontier, consistently outperforming existing matryoshka baselines across visual-token budgets while preserving the "train once, deploy anywhere" paradigm.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Question Answering | ActivityNet-QA | -- | 418 | |
| Document Visual Question Answering | DocVQA (val) | Accuracy63.7 | 166 | |
| Chart Question Answering | ChartQA augmented | Accuracy85.8 | 26 | |
| Chart Visual Question Answering | ChartQA human | Score37.2 | 10 | |
| Referring Image Segmentation | RefCOCO avg | Score63.6 | 10 | |
| Referring Image Segmentation | RefCOCO+ avg | Score58.4 | 10 | |
| Referring Image Segmentation | RefCOCO-g Avg. | Score57.8 | 10 | |
| Video Captioning | MSRVTT Cap | Score68.9 | 10 | |
| Video Understanding | Video benchmarks Aggregated | Macro Average Score53.1 | 9 | |
| Vision-Language Understanding (Image) | RefCOCO Resolution-Sensitive and General | Absolute Macro Score70.4 | 9 |