UPop: Unified and Progressive Pruning for Compressing Vision-Language Transformers

About

Real-world data contains a vast amount of multimodal information, among which vision and language are the two most representative modalities. Moreover, increasingly heavier models, \textit{e}.\textit{g}., Transformers, have attracted the attention of researchers to model compression. However, how to compress multimodal models, especially vison-language Transformers, is still under-explored. This paper proposes the \textbf{U}nified and \textbf{P}r\textbf{o}gressive \textbf{P}runing (\textbf{\emph{UPop}}) as a universal vison-language Transformer compression framework, which incorporates 1) unifiedly searching multimodal subnets in a continuous optimization space from the original model, which enables automatic assignment of pruning ratios among compressible modalities and structures; 2) progressively searching and retraining the subnet, which maintains convergence between the search and retrain to attain higher compression ratios. Experiments on various tasks, datasets, and model architectures demonstrate the effectiveness and versatility of the proposed UPop framework. The code is available at https://github.com/sdc17/UPop.

Dachuan Shi, Chaofan Tao, Ying Jin, Zhendong Yang, Chun Yuan, Jiaqi Wang• 2023

Related benchmarks

Task	Dataset	Result
Visual Question Answering	VQA v2 (test-dev)	Overall Accuracy74.5	712
Text-to-Image Retrieval	Flickr30K	R@182	559
Image-to-Text Retrieval	Flickr30K 1K (test)	R@194	491
Visual Question Answering	VQA v2 (test-std)	Accuracy76.3	486
Image-to-Text Retrieval	Flickr30K	R@194	451
Text-to-Image Retrieval	Flickr30K 1K (test)	R@182	432
Visual Question Answering	VQA 2.0 (test-dev)	Accuracy76.3	337
Text-to-Image Retrieval	MS-COCO 5K (test)	R@159.8	244
Text-to-Image Retrieval	COCO	Recall@159.8	156
Image-to-Text Retrieval	COCO	R@177.4	152

Showing 10 of 23 rows

Other info

Follow for update

@wizwand_team Discord