Logic-in-Frames: Dynamic Keyframe Search via Visual Semantic-Logical Verification for Long Video Understanding

About

Understanding long video content is a complex endeavor that often relies on densely sampled frame captions or end-to-end feature selectors, yet these techniques commonly overlook the logical relationships between textual queries and visual elements. In practice, computational constraints necessitate coarse frame subsampling, a challenge analogous to "finding a needle in a haystack." To address this issue, we introduce a semantics-driven search framework that reformulates keyframe selection under the paradigm of Visual Semantic-Logical Search. Specifically, we systematically define four fundamental logical dependencies: 1) spatial co-occurrence, 2) temporal proximity, 3) attribute dependency, and 4) causal order. These relations dynamically update frame sampling distributions through an iterative refinement process, enabling context-aware identification of semantically critical frames tailored to specific query requirements. Our method establishes new SOTA performance on the manually annotated benchmark in key-frame selection metrics. Furthermore, when applied to downstream video question-answering tasks, the proposed approach demonstrates the best performance gains over existing methods on LongVideoBench and Video-MME, validating its effectiveness in bridging the logical gap between textual queries and visual-temporal reasoning. The code will be publicly available.

Weiyu Guo, Ziyang Chen, Shaoguang Wang, Jianxiang He, Yijie Xu, Jinhui Ye, Ying Sun, Hui Xiong• 2025

Related benchmarks

Task	Dataset	Result
Long Video Understanding	LongVideoBench (val)	Accuracy63.4	282
Video Question Answering	LongVideoBench	Accuracy62.8	224
Video Understanding	LongVideoBench	Accuracy70.23	128
Video Understanding	Video-MME	Overall Score63	96
Video Question Answering	Video-MME without subtitles	Accuracy (Overall)63	81
Video Question Answering	Video-MME Long	Accuracy57.5	71
Video Question Answering	VideoMME (test)	Short Length Accuracy74.35	61
Video Question Answering	VideoMME Medium	Accuracy61.9	53
Video Question Answering	LongVideoBench (test)	Accuracy (Long)52.66	42
Question Answering	Molmo2-Moment (M2M) v1 (test)	Accuracy52.8	38

Showing 10 of 22 rows

Other info

Follow for update

@wizwand_team Discord