Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

UPop: Unified and Progressive Pruning for Compressing Vision-Language Transformers

About

Real-world data contains a vast amount of multimodal information, among which vision and language are the two most representative modalities. Moreover, increasingly heavier models, \textit{e}.\textit{g}., Transformers, have attracted the attention of researchers to model compression. However, how to compress multimodal models, especially vison-language Transformers, is still under-explored. This paper proposes the \textbf{U}nified and \textbf{P}r\textbf{o}gressive \textbf{P}runing (\textbf{\emph{UPop}}) as a universal vison-language Transformer compression framework, which incorporates 1) unifiedly searching multimodal subnets in a continuous optimization space from the original model, which enables automatic assignment of pruning ratios among compressible modalities and structures; 2) progressively searching and retraining the subnet, which maintains convergence between the search and retrain to attain higher compression ratios. Experiments on various tasks, datasets, and model architectures demonstrate the effectiveness and versatility of the proposed UPop framework. The code is available at https://github.com/sdc17/UPop.

Dachuan Shi, Chaofan Tao, Ying Jin, Zhendong Yang, Chun Yuan, Jiaqi Wang• 2023

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringVQA v2 (test-dev)
Overall Accuracy74.5
706
Text-to-Image RetrievalFlickr30K
R@182
531
Image-to-Text RetrievalFlickr30K 1K (test)
R@194
491
Visual Question AnsweringVQA v2 (test-std)
Accuracy76.3
486
Text-to-Image RetrievalFlickr30K 1K (test)
R@182
432
Image-to-Text RetrievalFlickr30K
R@194
429
Visual Question AnsweringVQA 2.0 (test-dev)
Accuracy76.3
337
Text-to-Image RetrievalMS-COCO 5K (test)
R@159.8
244
Text-to-Image RetrievalCOCO
Recall@159.8
156
Image-to-Text RetrievalCOCO
R@177.4
149
Showing 10 of 23 rows

Other info

Follow for update