Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

UPop: Unified and Progressive Pruning for Compressing Vision-Language Transformers

About

Real-world data contains a vast amount of multimodal information, among which vision and language are the two most representative modalities. Moreover, increasingly heavier models, \textit{e}.\textit{g}., Transformers, have attracted the attention of researchers to model compression. However, how to compress multimodal models, especially vison-language Transformers, is still under-explored. This paper proposes the \textbf{U}nified and \textbf{P}r\textbf{o}gressive \textbf{P}runing (\textbf{\emph{UPop}}) as a universal vison-language Transformer compression framework, which incorporates 1) unifiedly searching multimodal subnets in a continuous optimization space from the original model, which enables automatic assignment of pruning ratios among compressible modalities and structures; 2) progressively searching and retraining the subnet, which maintains convergence between the search and retrain to attain higher compression ratios. Experiments on various tasks, datasets, and model architectures demonstrate the effectiveness and versatility of the proposed UPop framework. The code is available at https://github.com/sdc17/UPop.

Dachuan Shi, Chaofan Tao, Ying Jin, Zhendong Yang, Chun Yuan, Jiaqi Wang• 2023

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringVQA v2 (test-std)
Accuracy76.3
466
Text-to-Image RetrievalFlickr30K
R@182
460
Image-to-Text RetrievalFlickr30K 1K (test)
R@194
439
Image-to-Text RetrievalFlickr30K
R@194
379
Text-to-Image RetrievalFlickr30K 1K (test)
R@182
375
Visual Question AnsweringVQA 2.0 (test-dev)
Accuracy76.3
337
Text-to-Image RetrievalMS-COCO 5K (test)
R@159.8
223
Text-to-Image RetrievalCOCO
Recall@159.8
130
Image-to-Text RetrievalCOCO
R@177.4
123
Visual ReasoningNLVR2 (test)
Accuracy81.13
44
Showing 10 of 15 rows

Other info

Follow for update