Logic-in-Frames: Dynamic Keyframe Search via Visual Semantic-Logical Verification for Long Video Understanding
About
Understanding long video content is a complex endeavor that often relies on densely sampled frame captions or end-to-end feature selectors, yet these techniques commonly overlook the logical relationships between textual queries and visual elements. In practice, computational constraints necessitate coarse frame subsampling, a challenge analogous to "finding a needle in a haystack." To address this issue, we introduce a semantics-driven search framework that reformulates keyframe selection under the paradigm of Visual Semantic-Logical Search. Specifically, we systematically define four fundamental logical dependencies: 1) spatial co-occurrence, 2) temporal proximity, 3) attribute dependency, and 4) causal order. These relations dynamically update frame sampling distributions through an iterative refinement process, enabling context-aware identification of semantically critical frames tailored to specific query requirements. Our method establishes new SOTA performance on the manually annotated benchmark in key-frame selection metrics. Furthermore, when applied to downstream video question-answering tasks, the proposed approach demonstrates the best performance gains over existing methods on LongVideoBench and Video-MME, validating its effectiveness in bridging the logical gap between textual queries and visual-temporal reasoning. The code will be publicly available.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Long Video Understanding | LongVideoBench (val) | Accuracy63.4 | 225 | |
| Video Question Answering | LongVideoBench | Accuracy62.8 | 210 | |
| Video Understanding | LongVideoBench | -- | 123 | |
| Video Understanding | Video-MME | Overall Score63 | 96 | |
| Video Question Answering | VideoMME (test) | Short Length Accuracy74.35 | 61 | |
| Video Question Answering | VideoMME Medium | Accuracy61.9 | 53 | |
| Video Question Answering | LongVideoBench (test) | Accuracy (Long)52.66 | 42 | |
| Video Question Answering | Video-MME Long | Accuracy57.5 | 41 | |
| Question Answering | Molmo2-Moment (M2M) v1 (test) | Accuracy52.8 | 38 | |
| Long Video Question Answering | Video-MME | Accuracy67.2 | 30 |