Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Activation Quantization of Vision Encoders Needs Prefixing Registers

About

Large pretrained vision encoders are central to multimodal intelligence, powering applications from on-device vision processing to vision-language models. Since these applications often demand real-time processing of massive visual data, reducing the inference cost of vision encoders is critical. Quantization offers a practical path, but it remains challenging even at 8-bit precision due to so-called outliers. In this work, we propose $\textit{RegCache}$, a training-free algorithm that mitigates outliers in large-scale pretrained vision encoders and serves as a plug-in module that can be applied on top of other quantization methods. RegCache introduces outlier-prone yet semantically meaningless prefix tokens to the vision encoder, which prevent other tokens from having outliers. Notably, we observe that outliers in vision encoders behave differently from those in language models, motivating two technical innovations: middle-layer prefixing and token deletion. Experimental results show that our method consistently improves quantized model performance across various vision encoders, particularly in extremely low-bit regimes (e.g., 4-bit).

Seunghyeon Kim, Taesun Yeom, Jinho Kim, Wonpyo Park, Kyuyeun Kim, Jaeho Lee• 2025

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringVQA v2
Accuracy58.83
1362
Image ClassificationFood-101
Accuracy88.17
542
Image ClassificationDTD
Accuracy63.62
542
Image ClassificationStanfordCars
Accuracy89.73
312
Image ClassificationCaltech101
Accuracy92.65
228
Video UnderstandingMLVU
Score45.4
221
ClassificationImageNet1K
Accuracy82.92
202
Image ClassificationFlowers-102
Top-1 Acc80.32
198
Text-to-Image RetrievalMS-COCO
R@152.5
151
Image-to-Text RetrievalMS-COCO
R@171.54
132
Showing 10 of 20 rows

Other info

Follow for update