Dynamic-VLM: Simple Dynamic Visual Token Compression for VideoLLM

About

The application of Large Vision-Language Models (LVLMs) for analyzing images and videos is an exciting and rapidly evolving field. In recent years, we've seen significant growth in high-quality image-text datasets for fine-tuning image understanding, but there is still a lack of comparable datasets for videos. Additionally, many VideoLLMs are extensions of single-image VLMs, which may not efficiently handle the complexities of longer videos. In this study, we introduce a large-scale synthetic dataset created from proprietary models, using carefully designed prompts to tackle a wide range of questions. We also explore a dynamic visual token compression architecture that strikes a balance between computational efficiency and performance. Our proposed \model{} achieves state-of-the-art results across various video tasks and shows impressive generalization, setting new baselines in multi-image understanding. Notably, \model{} delivers an absolute improvement of 2.7\% over LLaVA-OneVision on VideoMME and 10.7\% on MuirBench. Codes are available at https://github.com/Hon-Wong/ByteVideoLLM

Han Wang, Yuxiang Nie, Yongjie Ye, Deng GuanYu, Yanjie Wang, Shuai Li, Haiyang Yu, Jinghui Lu, Can Huang• 2024

Related benchmarks

Task	Dataset	Result
Long Video Understanding	LongVideoBench	Score70.1	269
Long Video Understanding	MLVU	--	205
Long Video Understanding	Video-MME	Overall Score64.6	48
Long Video Question Answering	MLVU	M-Avg70.1	39
Long Video Question Answering	Video-MME w/o subtitles	Accuracy0.646	14
VideoQA	Video-MME	VQA Accuracy (Overall)64.6	13
VideoQA	MLVU	Mean Score70.1	12

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord