Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Wavelet-based Frame Selection by Detecting Semantic Boundary for Long Video Understanding

About

Frame selection is crucial due to high frame redundancy and limited context windows when applying Large Vision-Language Models (LVLMs) to long videos. Current methods typically select frames with high relevance to a given query, resulting in a disjointed set of frames that disregard the narrative structure of video. In this paper, we introduce Wavelet-based Frame Selection by Detecting Semantic Boundary (WFS-SB), a training-free framework that presents a new perspective: effective video understanding hinges not only on high relevance but, more importantly, on capturing semantic shifts - pivotal moments of narrative change that are essential to comprehending the holistic storyline of video. However, direct detection of abrupt changes in the query-frame similarity signal is often unreliable due to high-frequency noise arising from model uncertainty and transient visual variations. To address this, we leverage the wavelet transform, which provides an ideal solution through its multi-resolution analysis in both time and frequency domains. By applying this transform, we decompose the noisy signal into multiple scales and extract a clean semantic change signal from the coarsest scale. We identify the local extrema of this signal as semantic boundaries, which segment the video into coherent clips. Building on this, WFS-SB comprises a two-stage strategy: first, adaptively allocating a frame budget to each clip based on a composite importance score; and second, within each clip, employing the Maximal Marginal Relevance approach to select a diverse yet relevant set of frames. Extensive experiments show that WFS-SB significantly boosts LVLM performance, e.g., improving accuracy by 5.5% on VideoMME, 9.5% on MLVU, and 6.2% on LongVideoBench, consistently outperforming state-of-the-art methods. Our code is available at https://github.com/MAC-AutoML/WFS-SB.

Wang Chen, Yuhui Zeng, Yongdong Luo, Tianyu Xie, Luojun Lin, Jiayi Ji, Yan Zhang, Xiawu Zheng• 2026

Related benchmarks

TaskDatasetResultRank
Video Question AnsweringLongVideoBench
Accuracy64.9
210
Long Video UnderstandingMLVU--
205
Video UnderstandingLongVideoBench--
56
Question AnsweringMolmo2-Moment (M2M) v1 (test)
Accuracy59
38
Long Video Question AnsweringVideo-MME
Accuracy72.6
30
Video UnderstandingVideoMME
Accuracy (Base)65.6
22
Video UnderstandingMLVU
Base Accuracy68.4
18
Caption RetrievalM2M
HIT@111.4
8
Question RetrievalMolmo2-Moment (test)
HIT@116.9
8
Long Video UnderstandingLongVideoBench
Base Score51.7
3
Showing 10 of 10 rows

Other info

Follow for update