Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Breaking the Encoder Barrier for Seamless Video-Language Understanding

About

Most Video-Large Language Models (Video-LLMs) adopt an encoder-decoder framework, where a vision encoder extracts frame-wise features for processing by a language model. However, this approach incurs high computational costs, introduces resolution biases, and struggles to capture fine-grained multimodal interactions. To overcome these limitations, we propose ELVA, an encoder-free Video-LLM that directly models nuanced video-language interactions without relying on a vision encoder. ELVA employs token merging to construct a bottom-up hierarchical representation and incorporates a video guidance supervisor for direct spatiotemporal representation learning. Additionally, a hybrid-resolution mechanism strategically integrates high- and low-resolution frames as inputs to achieve an optimal balance between performance and efficiency. With only 7M publicly available video-text pairs, ELVA achieves performance on par with encoder-based Video-LLMs while reducing FLOPs by up to 95\% and inference latency by 92\%, offering a scalable and efficient solution for real-time video understanding.

Handong Li, Yiyuan Zhang, Longteng Guo, Xiangyu Yue, Jing Liu• 2025

Related benchmarks

TaskDatasetResultRank
Video UnderstandingMVBench
Accuracy51.2
563
Long Video UnderstandingMLVU
Accuracy51.8
205
Video UnderstandingVideoMME
Accuracy47.1
30
Showing 3 of 3 rows

Other info

Follow for update