Towards Joint Quantization and Token Pruning of Vision-Language Models
About
Deploying Vision-Language Models (VLMs) under aggressive low-bit inference remains challenging because inference cost is dominated by the long visual-token prefix during prefill and the growing KV cache during autoregressive decoding. Token pruning and low-bit quantization are complementary for reducing these costs, yet naive stage-wise combinations are often brittle due to a mismatch between quantization calibration and pruning execution. We present a collaborative quantization-and-pruning framework that unifies low-bit inference and deterministic visual-token pruning in a single deployable pipeline. The framework introduces the \textbf{Q}uantization \textbf{U}nified \textbf{O}ffline \textbf{T}oken \textbf{A}llocator (\textbf{QUOTA}), which converts low-bit calibration signals into a layer-wise token allocation schedule and materializes it as a pruning recipe. Token importance is evaluated under deployed W4A4 operators with a quantized KV cache by combining activation magnitude, attention cues, and an explicit low-bit risk signal, enabling consistent budgeted top-$k$ selection. Experiments on standard VLM benchmarks show improved robustness over stage-wise baselines under the same low-bit regime, achieving 95.65\% average retention while retaining only 30\% of visual tokens, compared with about 94.3\% retention for representative stage-wise combinations. The code will be released.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Object Hallucination Evaluation | POPE | -- | 2019 | |
| Vision-Language Understanding | MME | Average Score94.52 | 18 | |
| Aggregated Performance Evaluation | VLM Evaluation Average | Average Relative Performance93.38 | 11 | |
| Vision-Language Understanding | SEED-Bench Image | Average Accuracy75.19 | 11 | |
| Visual Question Answering | GQA | Relative Performance78.76 | 11 | |
| Multimodal Vision-Language Evaluation | MMB, MME, GQA, POPE, SQA^I | MMB Score92.14 | 10 | |
| Science Question Answering | ScienceQA image | Relative Performance96.85 | 8 |