CASP: Compression of Large Multimodal Models Based on Attention Sparsity

About

In this work, we propose an extreme compression technique for Large Multimodal Models (LMMs). While previous studies have explored quantization as an efficient post-training compression method for Large Language Models (LLMs), low-bit compression for multimodal models remains under-explored. The redundant nature of inputs in multimodal models results in a highly sparse attention matrix. We theoretically and experimentally demonstrate that the attention matrix's sparsity bounds the compression error of the Query and Key weight matrices. Based on this, we introduce CASP, a model compression technique for LMMs. Our approach performs a data-aware low-rank decomposition on the Query and Key weight matrix, followed by quantization across all layers based on an optimal bit allocation process. CASP is compatible with any quantization technique and enhances state-of-the-art 2-bit quantization methods (AQLM and QuIP#) by an average of 21% on image- and video-language benchmarks.

Mohsen Gholami, Mohammad Akbari, Kevin Cannons, Yong Zhang• 2025

Related benchmarks

Task	Dataset	Result
Language Modeling	WikiText2	Perplexity8.1	3785
Language Modeling	C4	Perplexity10.54	1565
Video Understanding	VideoMME	--	222
Multimodal Understanding	MME	--	207
Image Captioning	Flickr30K	CIDEr Score77.2	111
Image Captioning	NoCaps	CIDEr102.1	111
Vision-Language Modeling	LiveB	PPL5.69	28
Vision-Language Modeling	LWilder	Perplexity4.51	28
Image Captioning	COCO 17	CIDEr107	23
Image-Language Understanding	SQA	EM71.2	21

Showing 10 of 18 rows

Other info

Code

Follow for update

@wizwand_team Discord