Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

PyraVid: Hierarchical Multimodal Memory for Long-Horizon Video Reasoning

About

Memory has become an increasingly important component of agentic systems, as these systems are expected to reason over long-term experience. However, prior work has largely focused on unimodal memory, leaving multimodal memory relatively underexplored despite its central role in real-world applications. Compared with unimodal settings, multimodal memory introduces additional challenges, including heterogeneous input integration, person-centric information alignment, and evidence aggregation across different granularities. We present PyraVid, a hierarchical multimodal memory framework inspired by Event Segmentation Theory from cognitive science. PyraVid organizes long videos into a coarse-to-fine pyramid structure, enabling structured memory access and effective evidence aggregation. It further supports structure-guided memory expansion with pruning, allowing the retrieval of related events with strong causal connectivity but low semantic similarity while reducing noise. Experiments on multiple long-video understanding benchmarks show that PyraVid consistently improves performance across datasets, model scales, and question types, highlighting the effectiveness of hierarchical multimodal memory for long-horizon reasoning.

Sikuan Yan, Sicheng Dong, Haotong Wang, Ercong Nie, Yilun Liu, Jinhe Bi, Yingjie Xu, Susanna Schwarzmann, Riccardo Trivisonno, Volker Tresp, Yunpu Ma• 2026

Related benchmarks

TaskDatasetResultRank
Long Video UnderstandingLVBench
Accuracy58.5
218
Long Video UnderstandingVideo-MME Long
Accuracy69.1
92
Long Video UnderstandingM3-Bench robot
MDR47.8
8
Long Video UnderstandingM3-Bench web
MDR55.4
8
Showing 4 of 4 rows

Other info

Follow for update