LLaVA-Scissor: Token Compression with Semantic Connected Components for Video LLMs

About

In this paper, we present LLaVA-Scissor, a training-free token compression strategy designed for video multimodal large language models. Previous methods mostly attempt to compress tokens based on attention scores, but fail to effectively capture all semantic regions and often lead to token redundancy. Differently, we propose to leverage the Semantic Connected Components (SCC) approach that assigns tokens to distinct semantic regions within the token set, ensuring comprehensive semantic coverage. The outcome is a two-step spatio-temporal token compression strategy that utilizes SCC in both spatial and temporal domains. This strategy can effectively compress tokens by representing the entire video with a set of non-overlapping semantic tokens. We conduct extensive evaluations of the token compression capabilities of LLaVA-Scissor across diverse video understanding benchmarks, including video question answering, long video understanding, and comprehensive multi-choices benchmarks. Experimental results show that the proposed LLaVA-Scissor outperforms other token compression methods, achieving superior performance in various video understanding benchmarks, particularly at low token retention ratios. Project page: https://github.com/HumanMLLM/LLaVA-Scissor.

Boyuan Sun, Jiaxing Zhao, Xihan Wei, Qibin Hou• 2025

Related benchmarks

Task	Dataset	Result
Video Understanding	VideoMME	--	369
Video Question Answering	VideoMME	Accuracy57.4	254
Video Understanding	MLVU	Score67.76	233
Video Understanding	EgoSchema	EgoSchema Score65.8	185
Video Question Answering	NEXT-QA	Overall Accuracy81.2	122
Video Understanding	LVB	Accuracy55.12	101
Video Understanding	Video-MME	Overall Score61.15	92
Open-ended Video Question Answering	ActNet-QA	Accuracy47.89	18
Forgery Detection	FakeVLM and FakeShield Forgery Detection Suites	Accuracy (FakeClue)97.22	16

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord