Breaking the Encoder Barrier for Seamless Video-Language Understanding

About

Most Video-Large Language Models (Video-LLMs) adopt an encoder-decoder framework, where a vision encoder extracts frame-wise features for processing by a language model. However, this approach incurs high computational costs, introduces resolution biases, and struggles to capture fine-grained multimodal interactions. To overcome these limitations, we propose ELVA, an encoder-free Video-LLM that directly models nuanced video-language interactions without relying on a vision encoder. ELVA employs token merging to construct a bottom-up hierarchical representation and incorporates a video guidance supervisor for direct spatiotemporal representation learning. Additionally, a hybrid-resolution mechanism strategically integrates high- and low-resolution frames as inputs to achieve an optimal balance between performance and efficiency. With only 7M publicly available video-text pairs, ELVA achieves performance on par with encoder-based Video-LLMs while reducing FLOPs by up to 95\% and inference latency by 92\%, offering a scalable and efficient solution for real-time video understanding.

Handong Li, Yiyuan Zhang, Longteng Guo, Xiangyu Yue, Jing Liu• 2025

Related benchmarks

Task	Dataset	Result
Video Understanding	MVBench	Accuracy51.2	635
Long Video Understanding	MLVU	Accuracy51.8	265
Video Understanding	VideoMME	Accuracy47.1	33

Showing 3 of 3 rows

Other info

Follow for update

@wizwand_team Discord