Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Logic-in-Frames: Dynamic Keyframe Search via Visual Semantic-Logical Verification for Long Video Understanding

About

Understanding long video content is a complex endeavor that often relies on densely sampled frame captions or end-to-end feature selectors, yet these techniques commonly overlook the logical relationships between textual queries and visual elements. In practice, computational constraints necessitate coarse frame subsampling, a challenge analogous to "finding a needle in a haystack." To address this issue, we introduce a semantics-driven search framework that reformulates keyframe selection under the paradigm of Visual Semantic-Logical Search. Specifically, we systematically define four fundamental logical dependencies: 1) spatial co-occurrence, 2) temporal proximity, 3) attribute dependency, and 4) causal order. These relations dynamically update frame sampling distributions through an iterative refinement process, enabling context-aware identification of semantically critical frames tailored to specific query requirements. Our method establishes new SOTA performance on the manually annotated benchmark in key-frame selection metrics. Furthermore, when applied to downstream video question-answering tasks, the proposed approach demonstrates the best performance gains over existing methods on LongVideoBench and Video-MME, validating its effectiveness in bridging the logical gap between textual queries and visual-temporal reasoning. The code will be publicly available.

Weiyu Guo, Ziyang Chen, Shaoguang Wang, Jianxiang He, Yijie Xu, Jinhui Ye, Ying Sun, Hui Xiong• 2025

Related benchmarks

TaskDatasetResultRank
Long Video UnderstandingLongVideoBench (val)
Accuracy63.4
225
Video Question AnsweringLongVideoBench
Accuracy62.8
210
Video UnderstandingLongVideoBench--
123
Video UnderstandingVideo-MME
Overall Score63
96
Video Question AnsweringVideoMME (test)
Short Length Accuracy74.35
61
Video Question AnsweringVideoMME Medium
Accuracy61.9
53
Video Question AnsweringLongVideoBench (test)
Accuracy (Long)52.66
42
Video Question AnsweringVideo-MME Long
Accuracy57.5
41
Question AnsweringMolmo2-Moment (M2M) v1 (test)
Accuracy52.8
38
Long Video Question AnsweringVideo-MME
Accuracy67.2
30
Showing 10 of 21 rows

Other info

Follow for update