Dynamic-VLM: Simple Dynamic Visual Token Compression for VideoLLM
About
The application of Large Vision-Language Models (LVLMs) for analyzing images and videos is an exciting and rapidly evolving field. In recent years, we've seen significant growth in high-quality image-text datasets for fine-tuning image understanding, but there is still a lack of comparable datasets for videos. Additionally, many VideoLLMs are extensions of single-image VLMs, which may not efficiently handle the complexities of longer videos. In this study, we introduce a large-scale synthetic dataset created from proprietary models, using carefully designed prompts to tackle a wide range of questions. We also explore a dynamic visual token compression architecture that strikes a balance between computational efficiency and performance. Our proposed \model{} achieves state-of-the-art results across various video tasks and shows impressive generalization, setting new baselines in multi-image understanding. Notably, \model{} delivers an absolute improvement of 2.7\% over LLaVA-OneVision on VideoMME and 10.7\% on MuirBench. Codes are available at https://github.com/Hon-Wong/ByteVideoLLM
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Long Video Understanding | LongVideoBench | Score70.1 | 248 | |
| Long Video Understanding | MLVU | Score70.1 | 154 | |
| Long Video Question Answering | MLVU | M-Avg70.1 | 39 | |
| Long Video Understanding | Video-MME | Overall Score64.6 | 30 | |
| Long Video Question Answering | Video-MME w/o subtitles | Accuracy0.646 | 14 | |
| VideoQA | Video-MME | VQA Accuracy (Overall)64.6 | 13 | |
| VideoQA | MLVU | Mean Score70.1 | 12 |