Beyond Training: Dynamic Token Merging for Zero-Shot Video Understanding

About

Recent advancements in multimodal large language models (MLLMs) have opened new avenues for video understanding. However, achieving high fidelity in zero-shot video tasks remains challenging. Traditional video processing methods rely heavily on fine-tuning to capture nuanced spatial-temporal details, which incurs significant data and computation costs. In contrast, training-free approaches, though efficient, often lack robustness in preserving context-rich features across complex video content. To this end, we propose DYTO, a novel dynamic token merging framework for zero-shot video understanding that adaptively optimizes token efficiency while preserving crucial scene details. DYTO integrates a hierarchical frame selection and a bipartite token merging strategy to dynamically cluster key frames and selectively compress token sequences, striking a balance between computational efficiency with semantic richness. Extensive experiments across multiple benchmarks demonstrate the effectiveness of DYTO, achieving superior performance compared to both fine-tuned and training-free methods and setting a new state-of-the-art for zero-shot video understanding.

Yiming Zhang, Zhuokai Zhao, Zhaorun Chen, Zenghui Ding, Xianjun Yang, Yining Sun• 2024

Related benchmarks

Task	Dataset	Result
Video Question Answering	EgoSchema	Accuracy56.8	161
Video Question Answering	NExT-QA Multi-choice	Accuracy72.9	114
Video Question Answering	NextQA	Accuracy72.9	78
Multiple-choice Video Question Answering	EgoSchema	Accuracy56.8	61
Video-based Question Answering	STAR	Accuracy51.1	50
Video Question Answering	Video-MME	Accuracy (Average, wo/ Subtitle)53.4	48
Multiple Choice VideoQA	IntentQA	Accuracy66.4	41
Multi-choice Video Question Answering	VideoMME	--	21

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord