LinVT: Empower Your Image-level Large Language Model to Understand Videos
About
Large Language Models (LLMs) have been widely used in various tasks, motivating us to develop an LLM-based assistant for videos. Instead of training from scratch, we propose a module to transform arbitrary well-trained image-based LLMs into video-LLMs (after being trained on video data). To better adapt image-LLMs for processing videos, we introduce two design principles: linear transformation to preserve the original visual-language alignment and representative information condensation from redundant video content. Guided by these principles, we propose a plug-and-play Linear Video Tokenizer(LinVT), which enables existing image-LLMs to understand videos. We benchmark LinVT with six recent visual LLMs: Aquila, Blip-3, InternVL2, Mipha, Molmo and Qwen2-VL, showcasing the high compatibility of LinVT. LinVT-based LLMs achieve state-of-the-art performance across various video benchmarks, illustrating the effectiveness of LinVT in multi-modal video understanding.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Understanding | MVBench | Accuracy69.3 | 247 | |
| Video Question Answering | EgoSchema (Full) | Accuracy69.5 | 193 | |
| Video Question Answering | NEXT-QA | Overall Accuracy85.5 | 105 | |
| Long Video Understanding | MLVU | Accuracy68.9 | 72 | |
| Video Question Answering | MSVD-QA zero-shot (test) | Accuracy80.2 | 56 | |
| Video Question Answering | ActivityNet-QA zero-shot (test) | Accuracy60.1 | 55 | |
| Video Question Answering | MSRVTT-QA zero-shot (test) | Accuracy66.2 | 55 | |
| Temporal Video Understanding | TempCompass | Average Score65.8 | 52 | |
| Video Understanding | EgoSchema | Accuracy69.5 | 49 | |
| Long Video Understanding | Video-MME long 1.0 | Accuracy (No Subs)63.1 | 45 |