Beyond Training: Dynamic Token Merging for Zero-Shot Video Understanding
About
Recent advancements in multimodal large language models (MLLMs) have opened new avenues for video understanding. However, achieving high fidelity in zero-shot video tasks remains challenging. Traditional video processing methods rely heavily on fine-tuning to capture nuanced spatial-temporal details, which incurs significant data and computation costs. In contrast, training-free approaches, though efficient, often lack robustness in preserving context-rich features across complex video content. To this end, we propose DYTO, a novel dynamic token merging framework for zero-shot video understanding that adaptively optimizes token efficiency while preserving crucial scene details. DYTO integrates a hierarchical frame selection and a bipartite token merging strategy to dynamically cluster key frames and selectively compress token sequences, striking a balance between computational efficiency with semantic richness. Extensive experiments across multiple benchmarks demonstrate the effectiveness of DYTO, achieving superior performance compared to both fine-tuned and training-free methods and setting a new state-of-the-art for zero-shot video understanding.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Question Answering | NExT-QA Multi-choice | Accuracy72.9 | 102 | |
| Multiple-choice Video Question Answering | EgoSchema | Accuracy56.8 | 61 | |
| Video-based Question Answering | STAR | Accuracy51.1 | 46 | |
| Multiple Choice VideoQA | IntentQA | Accuracy66.4 | 41 | |
| Multi-choice Video Question Answering | VideoMME | Accuracy53.4 | 13 |