Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Logic-in-Frames: Dynamic Keyframe Search via Visual Semantic-Logical Verification for Long Video Understanding

About

Understanding long video content is a complex endeavor that often relies on densely sampled frame captions or end-to-end feature selectors, yet these techniques commonly overlook the logical relationships between textual queries and visual elements. In practice, computational constraints necessitate coarse frame subsampling, a challenge analogous to "finding a needle in a haystack." To address this issue, we introduce a semantics-driven search framework that reformulates keyframe selection under the paradigm of Visual Semantic-Logical Search. Specifically, we systematically define four fundamental logical dependencies: 1) spatial co-occurrence, 2) temporal proximity, 3) attribute dependency, and 4) causal order. These relations dynamically update frame sampling distributions through an iterative refinement process, enabling context-aware identification of semantically critical frames tailored to specific query requirements. Our method establishes new SOTA performance on the manually annotated benchmark in key-frame selection metrics. Furthermore, when applied to downstream video question-answering tasks, the proposed approach demonstrates the best performance gains over existing methods on LongVideoBench and Video-MME, validating its effectiveness in bridging the logical gap between textual queries and visual-temporal reasoning. The code will be publicly available.

Weiyu Guo, Ziyang Chen, Shaoguang Wang, Jianxiang He, Yijie Xu, Jinhui Ye, Ying Sun, Hui Xiong• 2025

Related benchmarks

TaskDatasetResultRank
Long Video UnderstandingLongVideoBench (val)
Accuracy63.4
210
Video Question AnsweringLongVideoBench
Accuracy49.1
180
Video UnderstandingVideo-MME
Overall Score63
96
Video UnderstandingLongVideoBench--
92
Video Question AnsweringVideo-MME Long
Accuracy57.5
36
Video Question AnsweringVideoMME Medium
Accuracy61.9
27
Video Question AnsweringLONGVIDEOBENCH Medium
Accuracy52.2
24
Frame selection for long-form video QA10-minute video 600 frames at 1 FPS, K=16
E2E Latency (s)13.3
13
Long Video UnderstandingVideo-MME w/o sub (full)
Score (Long)55.2
13
Keyframe RetrievalLongVideoBench
Precision75.6
7
Showing 10 of 15 rows

Other info

Follow for update