Wavelet-based Frame Selection by Detecting Semantic Boundary for Long Video Understanding
About
Frame selection is crucial due to high frame redundancy and limited context windows when applying Large Vision-Language Models (LVLMs) to long videos. Current methods typically select frames with high relevance to a given query, resulting in a disjointed set of frames that disregard the narrative structure of video. In this paper, we introduce Wavelet-based Frame Selection by Detecting Semantic Boundary (WFS-SB), a training-free framework that presents a new perspective: effective video understanding hinges not only on high relevance but, more importantly, on capturing semantic shifts - pivotal moments of narrative change that are essential to comprehending the holistic storyline of video. However, direct detection of abrupt changes in the query-frame similarity signal is often unreliable due to high-frequency noise arising from model uncertainty and transient visual variations. To address this, we leverage the wavelet transform, which provides an ideal solution through its multi-resolution analysis in both time and frequency domains. By applying this transform, we decompose the noisy signal into multiple scales and extract a clean semantic change signal from the coarsest scale. We identify the local extrema of this signal as semantic boundaries, which segment the video into coherent clips. Building on this, WFS-SB comprises a two-stage strategy: first, adaptively allocating a frame budget to each clip based on a composite importance score; and second, within each clip, employing the Maximal Marginal Relevance approach to select a diverse yet relevant set of frames. Extensive experiments show that WFS-SB significantly boosts LVLM performance, e.g., improving accuracy by 5.5% on VideoMME, 9.5% on MLVU, and 6.2% on LongVideoBench, consistently outperforming state-of-the-art methods. Our code is available at https://github.com/MAC-AutoML/WFS-SB.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Question Answering | LongVideoBench | Accuracy64.9 | 210 | |
| Long Video Understanding | MLVU | -- | 205 | |
| Video Understanding | LongVideoBench | -- | 56 | |
| Question Answering | Molmo2-Moment (M2M) v1 (test) | Accuracy59 | 38 | |
| Long Video Question Answering | Video-MME | Accuracy72.6 | 30 | |
| Video Understanding | VideoMME | Accuracy (Base)65.6 | 22 | |
| Video Understanding | MLVU | Base Accuracy68.4 | 18 | |
| Caption Retrieval | M2M | HIT@111.4 | 8 | |
| Question Retrieval | Molmo2-Moment (test) | HIT@116.9 | 8 | |
| Long Video Understanding | LongVideoBench | Base Score51.7 | 3 |