LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via a Hybrid Architecture

About

Expanding the long-context capabilities of Multi-modal Large Language Models~(MLLMs) is critical for advancing video understanding and high-resolution image analysis. Achieving this requires systematic improvements in model architecture, data construction, and training strategies, particularly to address challenges such as performance degradation with increasing image counts and high computational costs. In this paper, we propose a hybrid architecture that integrates Mamba and Transformer blocks, introduce data construction methods that capture both temporal and spatial dependencies, and employ a progressive training strategy. Our released model, LongLLaVA (\textbf{Long}-Context \textbf{L}arge \textbf{L}anguage \textbf{a}nd \textbf{V}ision \textbf{A}ssistant), demonstrates an effective balance between efficiency and performance. LongLLaVA achieves competitive results across various benchmarks while maintaining high throughput and low memory consumption. Notably, it can process nearly one thousand images on a single A100 80GB GPU, underscoring its potential for a wide range of multi-modal applications.

Xidong Wang, Dingjie Song, Shunian Chen, Junyin Chen, Zhenyang Cai, Chen Zhang, Lichao Sun, Benyou Wang• 2024

Related benchmarks

Task	Dataset	Result
Video Question Answering	MVBench	Accuracy49.1	90
General Video Understanding	Video-MME	Accuracy52.9	82
Video Question Answering	VideoMME wo sub	Accuracy43.7	51
Long Video Understanding	Video-MME (full)	Overall Performance50.9	51
Long Video Understanding	Video-MME	--	48
Long Video Understanding	Video MME w/o sub (long)	Accuracy46.4	30
Event Sequencing	VECTOR L2 (Ne=8) 1.0 (test)	EM (Exact Match)267	26
Event Sequencing	VECTOR L1 (Ne=4) 1.0 (test)	EM Score9.33	26
Long Video Understanding	Video-MME (w/o sub.) Overall 1010s	Accuracy53.8	22
Long Video Event Prediction	MILES	Accuracy38.61	18

Showing 10 of 17 rows

Other info

Follow for update

@wizwand_team Discord