Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Beyond Training: Dynamic Token Merging for Zero-Shot Video Understanding

About

Recent advancements in multimodal large language models (MLLMs) have opened new avenues for video understanding. However, achieving high fidelity in zero-shot video tasks remains challenging. Traditional video processing methods rely heavily on fine-tuning to capture nuanced spatial-temporal details, which incurs significant data and computation costs. In contrast, training-free approaches, though efficient, often lack robustness in preserving context-rich features across complex video content. To this end, we propose DYTO, a novel dynamic token merging framework for zero-shot video understanding that adaptively optimizes token efficiency while preserving crucial scene details. DYTO integrates a hierarchical frame selection and a bipartite token merging strategy to dynamically cluster key frames and selectively compress token sequences, striking a balance between computational efficiency with semantic richness. Extensive experiments across multiple benchmarks demonstrate the effectiveness of DYTO, achieving superior performance compared to both fine-tuned and training-free methods and setting a new state-of-the-art for zero-shot video understanding.

Yiming Zhang, Zhuokai Zhao, Zhaorun Chen, Zenghui Ding, Xianjun Yang, Yining Sun• 2024

Related benchmarks

TaskDatasetResultRank
Video Question AnsweringNExT-QA Multi-choice
Accuracy72.9
102
Multiple-choice Video Question AnsweringEgoSchema
Accuracy56.8
61
Video-based Question AnsweringSTAR
Accuracy51.1
46
Multiple Choice VideoQAIntentQA
Accuracy66.4
41
Multi-choice Video Question AnsweringVideoMME
Accuracy53.4
13
Showing 5 of 5 rows

Other info

Follow for update