HiMu: Hierarchical Multimodal Frame Selection for Long Video Question Answering

About

Long-form video question answering requires reasoning over extended temporal contexts, making frame selection a critical bottleneck for multi-modal large language models (MLLMs) bound by finite context windows. Within the controlled frame-budget regime that governs practical deployment, prior selectors score frames against a single global query embedding; as a result, compositional multimodal questions that involve temporal ordering or cross-modal cues such as ``what happens on screen right after the narrator mentions the reaction?'' are flattened into a representation that loses sub-event ordering and modality bindings. We introduce \textbf{HiMu}, a training-free framework for compositional multimodal frame selection. A single text-only LLM call decomposes the query into a hierarchical logic tree whose leaves are atomic predicates, each routed to a lightweight expert spanning vision (CLIP, open-vocabulary detection, OCR) and audio (speech recognition and non-speech sound matching). Expert signals are normalized, smoothed to align across modalities, and composed bottom-up through fuzzy-logic operators that enforce temporal sequencing and adjacency, yielding a continuous per-frame satisfaction curve. Under the standard 16-frame budget on Video-MME, LongVideoBench, and HERBench-Lite, HiMu achieves state-of-the-art accuracy among frame selection methods and improves over uniform sampling across seven diverse MLLMs as a drop-in module, matching the accuracy of uniform sampling at $4\times$ its frame budget, without retraining and without multiple iterative MLLM calls during selection.

Dan Ben-Ami, Gabriele Serussi, Kobi Cohen, Chaim Baskin• 2026

Related benchmarks

Task	Dataset	Result
Long Video Understanding	LongVideoBench (val)	Accuracy70.13	282
Video Understanding	Video-MME	Overall Score78.18	96
Video Understanding	HERBench Lite	Accuracy43.22	18
Frame selection for long-form video QA	10-minute video 600 frames at 1 FPS, K=16	E2E Latency (s)13.3	13

Showing 4 of 4 rows

Other info

GitHub

Follow for update

@wizwand_team Discord