DynFocus: Dynamic Cooperative Network Empowers LLMs with Video Understanding

About

The challenge in LLM-based video understanding lies in preserving visual and semantic information in long videos while maintaining a memory-affordable token count. However, redundancy and correspondence in videos have hindered the performance potential of existing methods. Through statistical learning on current datasets, we observe that redundancy occurs in both repeated and answer-irrelevant frames, and the corresponding frames vary with different questions. This suggests the possibility of adopting dynamic encoding to balance detailed video information preservation with token budget reduction. To this end, we propose a dynamic cooperative network, DynFocus, for memory-efficient video encoding in this paper. Specifically, i) a Dynamic Event Prototype Estimation (DPE) module to dynamically select meaningful frames for question answering; (ii) a Compact Cooperative Encoding (CCE) module that encodes meaningful frames with detailed visual appearance and the remaining frames with sketchy perception separately. We evaluate our method on five publicly available benchmarks, and experimental results consistently demonstrate that our method achieves competitive performance.

Yudong Han, Qingpei Guo, Liyuan Pan, Liu Liu, Yu Guan, Ming Yang• 2024

Related benchmarks

Task	Dataset	Result
Long Video Understanding	LVBench	Accuracy31.5	218
Video Question Answering	LongVideoBench	Accuracy31.8	210
Long Video Understanding	MLVU	--	205
Video Question Answering	MLVU	Accuracy49.6	194
Video Understanding	Video-MME without subtitles	Overall Score44.1	108
Long Video Understanding	MLVU (dev)	--	63
Video Question Answering	MSVD-QA zero-shot (test)	Accuracy74.8	56
Video Question Answering	MSRVTT-QA zero-shot (test)	Accuracy62.8	55
Video Question Answering	ActivityNet-QA zero-shot (test)	Accuracy50.3	55
Video Question Answering	Video-MME	Accuracy (Average, wo/ Subtitle)47.8	48

Showing 10 of 17 rows

Other info

Code

Follow for update

@wizwand_team Discord