VCA: Video Curious Agent for Long Video Understanding

About

Long video understanding poses unique challenges due to their temporal complexity and low information density. Recent works address this task by sampling numerous frames or incorporating auxiliary tools using LLMs, both of which result in high computational costs. In this work, we introduce a curiosity-driven video agent with self-exploration capability, dubbed as VCA. Built upon VLMs, VCA autonomously navigates video segments and efficiently builds a comprehensive understanding of complex video sequences. Instead of directly sampling frames, VCA employs a tree-search structure to explore video segments and collect frames. Rather than relying on external feedback or reward, VCA leverages VLM's self-generated intrinsic reward to guide its exploration, enabling it to capture the most crucial information for reasoning. Experimental results on multiple long video benchmarks demonstrate our approach's superior effectiveness and efficiency.

Zeyuan Yang, Delin Chen, Xueyang Yu, Maohao Shen, Chuang Gan• 2024

Related benchmarks

Task	Dataset	Result
Long Video Understanding	LVBench	Accuracy41.3	267
Video Understanding	EgoSchema	--	185
Long-form Video Understanding	LongVideoBench	Accuracy41.3	135
Long-form Video Understanding	LVBench	Overall Score41.3	77
Video Question Answering	Video-MME Long	Accuracy56.3	71
Long Video Understanding	EgoSchema (val)	Accuracy73.6	39
Video Question Answering	LVBench	Overall Score41.3	38
Motion Understanding	MotionBench	Accuracy51.3	35
Video Question Answering	Video-MME Long Duration 1.0	Accuracy (w/o subtitles)56.3	34
Long Video Question Answering	LVBench	All Score41.3	31

Showing 10 of 20 rows

Other info

Follow for update

@wizwand_team Discord