Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Building a Mind Palace: Structuring Environment-Grounded Semantic Graphs for Effective Long Video Analysis with LLMs

About

Long-form video understanding with Large Vision Language Models is challenged by the need to analyze temporally dispersed yet spatially concentrated key moments within limited context windows. In this work, we introduce VideoMindPalace, a new framework inspired by the "Mind Palace", which organizes critical video moments into a topologically structured semantic graph. VideoMindPalace organizes key information through (i) hand-object tracking and interaction, (ii) clustered activity zones representing specific areas of recurring activities, and (iii) environment layout mapping, allowing natural language parsing by LLMs to provide grounded insights on spatio-temporal and 3D context. In addition, we propose the Video MindPalace Benchmark (VMB), to assess human-like reasoning, including spatial localization, temporal reasoning, and layout-aware sequential understanding. Evaluated on VMB and established video QA datasets, including EgoSchema, NExT-QA, IntentQA, and the Active Memories Benchmark, VideoMindPalace demonstrates notable gains in spatio-temporal coherence and human-aligned reasoning, advancing long-form video analysis capabilities in VLMs.

Zeyi Huang, Yuyang Ji, Xiaofang Wang, Nikhil Mehta, Tong Xiao, Donghyun Lee, Sigmund Vanvalkenburgh, Shengxin Zha, Bolin Lai, Yiqiu Ren, Licheng Yu, Ning Zhang, Yong Jae Lee, Miao Liu• 2025

Related benchmarks

TaskDatasetResultRank
Video Question AnsweringEgoSchema
Accuracy68.6
161
Video Question AnsweringIntentQA
Accuracy (All)70.1
35
Video Question AnsweringActive Memory Benchmark
Q144.7
8
Multiple-choice Video Question AnsweringVMB
Spatial Accuracy (S)53
6
Open-ended Video Question AnsweringVMB
OE Score3.8
5
Showing 5 of 5 rows

Other info

Follow for update