VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation

About

A well-known dilemma in large vision-language models (e.g., GPT-4, LLaVA) is that while increasing the number of vision tokens generally enhances visual understanding, it also significantly raises memory and computational costs, especially in long-term, dense video frame streaming scenarios. Although learnable approaches like Q-Former and Perceiver Resampler have been developed to reduce the vision token burden, they overlook the context causally modeled by LLMs (i.e., key-value cache), potentially leading to missed visual cues when addressing user queries. In this paper, we introduce a novel approach to reduce vision compute by leveraging redundant vision tokens "skipping layers" rather than decreasing the number of vision tokens. Our method, VideoLLM-MoD, is inspired by mixture-of-depths LLMs and addresses the challenge of numerous vision tokens in long-term or streaming video. Specifically, for each transformer layer, we learn to skip the computation for a high proportion (e.g., 80\%) of vision tokens, passing them directly to the next layer. This approach significantly enhances model efficiency, achieving approximately \textasciitilde42\% time and \textasciitilde30\% memory savings for the entire training. Moreover, our method reduces the computation in the context and avoid decreasing the vision tokens, thus preserving or even improving performance compared to the vanilla model. We conduct extensive experiments to demonstrate the effectiveness of VideoLLM-MoD, showing its state-of-the-art results on multiple benchmarks, including narration, forecasting, and summarization tasks in COIN, Ego4D, and Ego-Exo4D datasets.

Shiwei Wu, Joya Chen, Kevin Qinghong Lin, Qimeng Wang, Yan Gao, Qianli Xu, Tong Xu, Yao Hu, Enhong Chen, Mike Zheng Shou• 2024

Related benchmarks

Task	Dataset	Result
Step Forecasting	COIN	--	26
Task recognition	COIN	Accuracy92.8	22
Long Term Anticipation	Ego4D LTA v1 (test)	ED@Z=20 Verb0.689	18
Step Recognition	COIN	Top-1 Accuracy63.4	12
Fine-grained Keystep Recognition	EgoExo4D v1 (val)	Ego Accuracy44.85	11
Fine-grained Keystep Recognition	EgoExo4D v2 (val)	Ego Accuracy42.62	11
Instructional Video Understanding	COIN (test)	Step Recognition Top-1 Acc63.4	10
Streaming Narration	Ego4D	Perplexity (PPL)2.41	6
Streaming Narration	Ego4D v1 (test)	MACs9.64	4
Streaming Narration	EgoExo4D v1 (test)	MACs13.21	4

Showing 10 of 11 rows

Other info

Follow for update

@wizwand_team Discord