LinVT: Empower Your Image-level Large Language Model to Understand Videos

About

Large Language Models (LLMs) have been widely used in various tasks, motivating us to develop an LLM-based assistant for videos. Instead of training from scratch, we propose a module to transform arbitrary well-trained image-based LLMs into video-LLMs (after being trained on video data). To better adapt image-LLMs for processing videos, we introduce two design principles: linear transformation to preserve the original visual-language alignment and representative information condensation from redundant video content. Guided by these principles, we propose a plug-and-play Linear Video Tokenizer(LinVT), which enables existing image-LLMs to understand videos. We benchmark LinVT with six recent visual LLMs: Aquila, Blip-3, InternVL2, Mipha, Molmo and Qwen2-VL, showcasing the high compatibility of LinVT. LinVT-based LLMs achieve state-of-the-art performance across various video benchmarks, illustrating the effectiveness of LinVT in multi-modal video understanding.

Lishuai Gao, Yujie Zhong, Yingsen Zeng, Haoxian Tan, Dengjie Li, Zheng Zhao• 2024

Related benchmarks

Task	Dataset	Result
Video Understanding	MVBench	Accuracy69.3	563
Video Question Answering	EgoSchema (Full)	Accuracy69.5	241
Long Video Understanding	MLVU	Accuracy68.9	205
Video Understanding	EgoSchema	--	185
Temporal Video Understanding	TempCompass	--	141
Video Question Answering	NEXT-QA	Overall Accuracy85.5	105
Video Question Answering	MSVD-QA zero-shot (test)	Accuracy80.2	56
Video Question Answering	ActivityNet-QA zero-shot (test)	Accuracy60.1	55
Video Question Answering	MSRVTT-QA zero-shot (test)	Accuracy66.2	55
Long Video Understanding	Video-MME long 1.0	Accuracy (No Subs)63.1	45

Showing 10 of 17 rows

Other info

Code

Follow for update

@wizwand_team Discord