PyraVid: Hierarchical Multimodal Memory for Long-Horizon Video Reasoning

About

Memory has become an increasingly important component of agentic systems, as these systems are expected to reason over long-term experience. However, prior work has largely focused on unimodal memory, leaving multimodal memory relatively underexplored despite its central role in real-world applications. Compared with unimodal settings, multimodal memory introduces additional challenges, including heterogeneous input integration, person-centric information alignment, and evidence aggregation across different granularities. We present PyraVid, a hierarchical multimodal memory framework inspired by Event Segmentation Theory from cognitive science. PyraVid organizes long videos into a coarse-to-fine pyramid structure, enabling structured memory access and effective evidence aggregation. It further supports structure-guided memory expansion with pruning, allowing the retrieval of related events with strong causal connectivity but low semantic similarity while reducing noise. Experiments on multiple long-video understanding benchmarks show that PyraVid consistently improves performance across datasets, model scales, and question types, highlighting the effectiveness of hierarchical multimodal memory for long-horizon reasoning.

Sikuan Yan, Sicheng Dong, Haotong Wang, Ercong Nie, Yilun Liu, Jinhe Bi, Yingjie Xu, Susanna Schwarzmann, Riccardo Trivisonno, Volker Tresp, Yunpu Ma• 2026

Related benchmarks

Task	Dataset	Result
Long Video Understanding	LVBench	Accuracy58.5	267
Long Video Understanding	Video-MME Long	Accuracy69.1	120
Long Video Understanding	M3-Bench robot	MDR47.8	8
Long Video Understanding	M3-Bench web	MDR55.4	8

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord