MMViR: A Multi-Modal and Multi-Granularity Representation for Long-range Video Understanding

About

Long videos, ranging from minutes to hours, present significant challenges for current Multi-modal Large Language Models (MLLMs) due to their complex events, diverse scenes, and long-range dependencies. Direct encoding of such videos is computationally too expensive, while simple video-to-text conversion often results in redundant or fragmented content. To address these limitations, we introduce MMViR, a novel multi-modal, multi-grained structured representation for long video understanding. MMViR identifies key turning points to segment the video and constructs a three-level description that couples global narratives with fine-grained visual details. This design supports efficient query-based retrieval and generalizes well across various scenarios. Extensive evaluations across three tasks, including QA, summarization, and retrieval, show that MMViR outperforms the prior strongest method, achieving a 19.67% improvement in hour-long video understanding while reducing processing latency to 45.4% of the original.

Zizhong Li, Haopeng Zhang, Jiawei Zhang• 2026

Related benchmarks

Task	Dataset	Result
Video Question Answering	VideoMME	Accuracy54.7	251
Video Question Answering	EgoSchema	Accuracy65.4	161
Video Question Answering	HourVideo	Accuracy35.3	11
Video Summarization	HourVideo	R-2 Score10.67	3
Video Summarization	MovieChat-1K	ROUGE-24.27	3

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord