Wavelet-based Frame Selection by Detecting Semantic Boundary for Long Video Understanding

About

Frame selection is crucial due to high frame redundancy and limited context windows when applying Large Vision-Language Models (LVLMs) to long videos. Current methods typically select frames with high relevance to a given query, resulting in a disjointed set of frames that disregard the narrative structure of video. In this paper, we introduce Wavelet-based Frame Selection by Detecting Semantic Boundary (WFS-SB), a training-free framework that presents a new perspective: effective video understanding hinges not only on high relevance but, more importantly, on capturing semantic shifts - pivotal moments of narrative change that are essential to comprehending the holistic storyline of video. However, direct detection of abrupt changes in the query-frame similarity signal is often unreliable due to high-frequency noise arising from model uncertainty and transient visual variations. To address this, we leverage the wavelet transform, which provides an ideal solution through its multi-resolution analysis in both time and frequency domains. By applying this transform, we decompose the noisy signal into multiple scales and extract a clean semantic change signal from the coarsest scale. We identify the local extrema of this signal as semantic boundaries, which segment the video into coherent clips. Building on this, WFS-SB comprises a two-stage strategy: first, adaptively allocating a frame budget to each clip based on a composite importance score; and second, within each clip, employing the Maximal Marginal Relevance approach to select a diverse yet relevant set of frames. Extensive experiments show that WFS-SB significantly boosts LVLM performance, e.g., improving accuracy by 5.5% on VideoMME, 9.5% on MLVU, and 6.2% on LongVideoBench, consistently outperforming state-of-the-art methods. Our code is available at https://github.com/MAC-AutoML/WFS-SB.

Wang Chen, Yuhui Zeng, Yongdong Luo, Tianyu Xie, Luojun Lin, Jiayi Ji, Yan Zhang, Xiawu Zheng• 2026

Related benchmarks

Task	Dataset	Result
Long Video Understanding	MLVU	--	265
Video Question Answering	LongVideoBench	Accuracy64.9	224
Video Understanding	LongVideoBench	--	59
Question Answering	Molmo2-Moment (M2M) v1 (test)	Accuracy59	38
Long Video Question Answering	Video-MME	Accuracy72.6	30
Video Understanding	VideoMME	Accuracy (Base)65.6	22
Video Understanding	MLVU	Base Accuracy68.4	18
Caption Retrieval	M2M	HIT@111.4	8
Question Retrieval	Molmo2-Moment (test)	HIT@116.9	8
Long Video Understanding	LongVideoBench	Base Score51.7	3

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord