FLoC: Facility Location-Based Efficient Visual Token Compression for Long Video Understanding

About

Recent studies in long video understanding have harnessed the advanced visual-language reasoning capabilities of Large Multimodal Models (LMMs), driving the evolution of video-LMMs specialized for processing extended video sequences. However, the scalability of these models is severely limited by the overwhelming volume of visual tokens generated from extended video sequences. To address this challenge, we propose FLoC, an efficient visual token compression framework based on the facility location function, a principled approach that swiftly selects a compact yet highly representative and diverse subset of visual tokens within a predefined budget on the number of visual tokens. By integrating the lazy greedy algorithm, our method achieves remarkable efficiency gains by swiftly selecting a compact subset of tokens, drastically reducing the number of visual tokens while guaranteeing near-optimal performance. Notably, our approach is training-free, model-agnostic, and query-agnostic, providing a versatile solution that seamlessly integrates with diverse video-LLMs and existing workflows. Extensive evaluations on large-scale benchmarks, such as Video-MME, MLVU, LongVideoBench, and EgoSchema, show that our framework consistently surpasses recent compression techniques, highlighting its effectiveness and robustness in addressing the challenges of long video understanding as well as its processing efficiency.

Janghoon Cho, Jungsoo Lee, Munawar Hayat, Kyuwoong Hwang, Fatih Porikli, Sungha Choi• 2025

Related benchmarks

Task	Dataset	Result
3D Question Answering	ScanQA (val)	CIDEr99.4	391
Long Video Understanding	MLVU	--	265
Video Understanding	MLVU	Score71.57	233
3D Question Answering	SQA3D (test)	EM@157.2	197
Video Understanding	EgoSchema	EgoSchema Score69.4	185
Video Understanding	LVB	Accuracy70.63	101
Video Understanding	Video-MME	Overall Score61.04	96
Video Understanding	Video-MME	Overall Score64.93	92
3D Question Answering	VSI-Bench	Average Score36.6	88
Video Understanding	Video-MME v1.0 (test)	Score (Short)72	56

Showing 10 of 16 rows

Other info

Follow for update

@wizwand_team Discord