Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

DeepStack: Deeply Stacking Visual Tokens is Surprisingly Simple and Effective for LMMs

About

Most large multimodal models (LMMs) are implemented by feeding visual tokens as a sequence into the first layer of a large language model (LLM). The resulting architecture is simple but significantly increases computation and memory costs, as it has to handle a large number of additional tokens in its input layer. This paper presents a new architecture DeepStack for LMMs. Considering $N$ layers in the language and vision transformer of LMMs, we stack the visual tokens into $N$ groups and feed each group to its aligned transformer layer \textit{from bottom to top}. Surprisingly, this simple method greatly enhances the power of LMMs to model interactions among visual tokens across layers but with minimal additional cost. We apply DeepStack to both language and vision transformer in LMMs, and validate the effectiveness of DeepStack LMMs with extensive empirical results. Using the same context length, our DeepStack 7B and 13B parameters surpass their counterparts by \textbf{2.7} and \textbf{2.9} on average across \textbf{9} benchmarks, respectively. Using only one-fifth of the context length, DeepStack rivals closely to the counterparts that use the full context length. These gains are particularly pronounced on high-resolution tasks, e.g., \textbf{4.2}, \textbf{11.0}, and \textbf{4.0} improvements on TextVQA, DocVQA, and InfoVQA compared to LLaVA-1.5-7B, respectively. We further apply DeepStack to vision transformer layers, which brings us a similar amount of improvements, \textbf{3.8} on average compared with LLaVA-1.5-7B.

Lingchen Meng, Jianwei Yang, Rui Tian, Xiyang Dai, Zuxuan Wu, Jianfeng Gao, Yu-Gang Jiang• 2024

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringVQA v2
Accuracy83
1165
Visual Question AnsweringVizWiz
Accuracy50.3
1043
Visual Question AnsweringGQA
Accuracy66.2
963
Object Hallucination EvaluationPOPE
Accuracy87.7
935
Text-based Visual Question AnsweringTextVQA
Accuracy62.4
496
Multimodal UnderstandingMMBench--
367
Video Question AnsweringMSVD-QA
Accuracy76
340
Video Question AnsweringActivityNet-QA
Accuracy49.3
319
Video Question AnsweringActivityNet-QA (test)
Accuracy49.3
275
Video Question AnsweringMSVD-QA (test)
Accuracy76
274
Showing 10 of 48 rows

Other info

Code

Follow for update