Activation Quantization of Vision Encoders Needs Prefixing Registers

About

Large pretrained vision encoders are central to multimodal intelligence, powering applications from on-device vision processing to vision-language models. Since these applications often demand real-time processing of massive visual data, reducing the inference cost of vision encoders is critical. Quantization offers a practical path, but it remains challenging even at 8-bit precision due to so-called outliers. In this work, we propose $\textit{RegCache}$, a training-free algorithm that mitigates outliers in large-scale pretrained vision encoders and serves as a plug-in module that can be applied on top of other quantization methods. RegCache introduces outlier-prone yet semantically meaningless prefix tokens to the vision encoder, which prevent other tokens from having outliers. Notably, we observe that outliers in vision encoders behave differently from those in language models, motivating two technical innovations: middle-layer prefixing and token deletion. Experimental results show that our method consistently improves quantized model performance across various vision encoders, particularly in extremely low-bit regimes (e.g., 4-bit).

Seunghyeon Kim, Taesun Yeom, Jinho Kim, Wonpyo Park, Kyuyeun Kim, Jaeho Lee• 2025

Related benchmarks

Task	Dataset	Result
Visual Question Answering	VQA v2	Accuracy58.83	1429
Image Classification	DTD	Accuracy63.62	599
Image Classification	Food-101	Accuracy88.17	570
Image Classification	StanfordCars	Accuracy89.73	384
Image Classification	Caltech101	Accuracy92.65	228
Video Understanding	MLVU	Score45.4	221
Image Classification	CIFAR-100	Accuracy71.73	204
Classification	ImageNet1K	Accuracy82.92	202
Image Classification	Flowers-102	Top-1 Acc80.32	198
Text-to-Image Retrieval	MS-COCO	R@152.5	187

Showing 10 of 20 rows

Other info

Follow for update

@wizwand_team Discord