VideoARM: Agentic Reasoning over Hierarchical Memory for Long-Form Video Understanding

About

Long-form video understanding remains challenging due to the extended temporal structure and dense multimodal cues. Despite recent progress, many existing approaches still rely on hand-crafted reasoning pipelines or employ token-consuming video preprocessing to guide MLLMs in autonomous reasoning. To overcome these limitations, we introduce VideoARM, an Agentic Reasoning-over-hierarchical-Memory paradigm for long-form video understanding. Instead of static, exhaustive preprocessing, VideoARM performs adaptive, on-the-fly agentic reasoning and memory construction. Specifically, VideoARM performs an adaptive and continuous loop of observing, thinking, acting, and memorizing, where a controller autonomously invokes tools to interpret the video in a coarse-to-fine manner, thereby substantially reducing token consumption. In parallel, a hierarchical multimodal memory continuously captures and updates multi-level clues throughout the operation of the agent, providing precise contextual information to support the controller in decision-making. Experiments on prevalent benchmarks demonstrate that VideoARM outperforms the state-of-the-art method, DVD, while significantly reducing token consumption for long-form videos.

Yufei Yin, Qianke Meng, Minghao Chen, Jiajun Ding, Zhenwei Shao, Zhou Yu• 2025

Related benchmarks

Task	Dataset	Result
Long Video Understanding	LongVideoBench (val)	--	282
Long-form Video Understanding	LVBench	Overall Score79.7	77
Long Video Understanding	EgoSchema (val)	Accuracy76.2	39
Video Question Answering	Video-MME no subs standard Long	Accuracy81.2	29
Long-form Egocentric Video Understanding	EgoSchema	Accuracy78.2	25
Temporal Grounding	CoMET-Bench	mIoU13.5	21
Negative Query Recognition	CoMET-Bench	Rej.-F161	21
Counting	CoMET-Bench	MAE3.6	21
Video Question Answering	LongVideoBench (standard)	Accuracy76.4	19
Long Video Understanding	Video-MME w/o sub (full)	Score (Long)81.2	13

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord