Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Efficient Large Multi-modal Models via Visual Context Compression

About

While significant advancements have been made in compressed representations for text embeddings in large language models (LLMs), the compression of visual tokens in multi-modal LLMs (MLLMs) has remained a largely overlooked area. In this work, we present the study on the analysis of redundancy concerning visual tokens and efficient training within these models. Our initial experiments show that eliminating up to 70% of visual tokens at the testing stage by simply average pooling only leads to a minimal 3% reduction in visual question answering accuracy on the GQA benchmark, indicating significant redundancy in visual context. Addressing this, we introduce Visual Context Compressor, which reduces the number of visual tokens to enhance training and inference efficiency without sacrificing performance. To minimize information loss caused by the compression on visual tokens while maintaining training efficiency, we develop LLaVolta as a light and staged training scheme that incorporates stage-wise visual context compression to progressively compress the visual tokens from heavily to lightly compression during training, yielding no loss of information when testing. Extensive experiments demonstrate that our approach enhances the performance of MLLMs in both image-language and video-language understanding, while also significantly cutting training costs and improving inference efficiency.

Jieneng Chen, Luoxin Ye, Ju He, Zhao-Yang Wang, Daniel Khashabi, Alan Yuille• 2024

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringVQA v2
Accuracy79.7
1165
Visual Question AnsweringTextVQA
Accuracy58.7
1117
Visual Question AnsweringVizWiz
Accuracy53.8
1043
Visual Question AnsweringGQA
Accuracy63
963
Object Hallucination EvaluationPOPE
Accuracy86.8
935
Multimodal EvaluationMME
Score1.52e+3
557
Video Question AnsweringMSRVTT-QA (test)
Accuracy57.2
371
Multimodal UnderstandingMMBench--
367
Multimodal Capability EvaluationMM-Vet
Score31.9
282
Multimodal ReasoningMM-Vet
MM-Vet Score35.4
281
Showing 10 of 29 rows

Other info

Follow for update