Vamba: Understanding Hour-Long Videos with Hybrid Mamba-Transformers

About

State-of-the-art transformer-based large multimodal models (LMMs) struggle to handle hour-long video inputs due to the quadratic complexity of the causal self-attention operations, leading to high computational costs during training and inference. Existing token compression-based methods reduce the number of video tokens but often incur information loss and remain inefficient for extremely long sequences. In this paper, we explore an orthogonal direction to build a hybrid Mamba-Transformer model (VAMBA) that employs Mamba-2 blocks to encode video tokens with linear complexity. Without any token reduction, VAMBA can encode more than 1024 frames (640$\times$360) on a single GPU, while transformer-based models can only encode 256 frames. On long video input, VAMBA achieves at least 50% reduction in GPU memory usage during training and inference, and nearly doubles the speed per training step compared to transformer-based LMMs. Our experimental results demonstrate that VAMBA improves accuracy by 4.3% on the challenging hour-long video understanding benchmark LVBench over prior efficient video LMMs, and maintains strong performance on a broad spectrum of long and short video understanding tasks.

Weiming Ren, Wentao Ma, Huan Yang, Cong Wei, Ge Zhang, Wenhu Chen• 2025

Related benchmarks

Task	Dataset	Result
Long Video Understanding	LVBench	Accuracy42.1	267
Long Video Understanding	MLVU	Accuracy65.9	265
Long-form Video Understanding	LongVideoBench	Accuracy55.9	135
Long Video Understanding	VideoMME	Accuracy57.8	97
Long Video Understanding	Video-MME	Overall Score57.8	90
Long Video Understanding	Video-MME Overall	Accuracy57.8	81
Long Video Understanding	MLVU (dev)	--	63
Long Video Understanding	LongVideoBench	Average Performance55.9	24
Video Understanding	VideoMME w/o sub. Overall	Overall Accuracy57.8	22
Video Multimodal Evaluation	Video-MME	--	18

Showing 10 of 13 rows

Other info

Follow for update

@wizwand_team Discord