Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Progressive Video Condensation with MLLM Agent for Long-form Video Understanding

About

Understanding long videos requires extracting query-relevant information from long sequences under tight compute budgets. Existing text-then-LLM pipelines lose fine-grained visual cues, while video-based multimodal large language models (MLLMs) can keep visual details but are too frame-hungry and computationally expensive. In this work, we aim to harness MLLMs for efficient video understanding. We propose ProVCA, a progressive video condensation agent that iteratively locates key video frames at multiple granularities. ProVCA first adopts a segment localization module to identify the video segment relevant to the query, then a snippet selection module to select important snippets based on similarity, and finally a keyframe refinement module to pinpoint specific keyframes in those snippets. By progressively narrowing the scope from coarse segments to fine frames, ProVCA identifies a small set of keyframes for MLLM-based reasoning. ProVCA achieves state-of-the-art zero-shot accuracies of 69.3\% on EgoSchema, 80.5\% on NExT-QA, and 77.7\% on IntentQA, while using fewer frames than previous training-free methods.

Yufei Yin, Yuchen Xing, Qianke Meng, Minghao Chen, Yan Yang, Zhou Yu• 2026

Related benchmarks

TaskDatasetResultRank
Video Question AnsweringEgoSchema (Full)
Accuracy69.3
221
Video Question AnsweringEgoSchema subset
Accuracy74.2
114
Video Question AnsweringIntentQA
Accuracy (All)77.7
35
Showing 3 of 3 rows

Other info

Follow for update