Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Understanding Long Videos via LLM-Powered Entity Relation Graphs

About

The analysis of extended video content poses unique challenges in artificial intelligence, particularly when dealing with the complexity of tracking and understanding visual elements across time. Current methodologies that process video frames sequentially struggle to maintain coherent tracking of objects, especially when these objects temporarily vanish and later reappear in the footage. A critical limitation of these approaches is their inability to effectively identify crucial moments in the video, largely due to their limited grasp of temporal relationships. To overcome these obstacles, we present GraphVideoAgent, a cutting-edge system that leverages the power of graph-based object tracking in conjunction with large language model capabilities. At its core, our framework employs a dynamic graph structure that maps and monitors the evolving relationships between visual entities throughout the video sequence. This innovative approach enables more nuanced understanding of how objects interact and transform over time, facilitating improved frame selection through comprehensive contextual awareness. Our approach demonstrates remarkable effectiveness when tested against industry benchmarks. In evaluations on the EgoSchema dataset, GraphVideoAgent achieved a 2.2 improvement over existing methods while requiring analysis of only 8.2 frames on average. Similarly, testing on the NExT-QA benchmark yielded a 2.0 performance increase with an average frame requirement of 8.1. These results underscore the efficiency of our graph-guided methodology in enhancing both accuracy and computational performance in long-form video understanding tasks.

Meng Chu, Yicong Li, Tat-Seng Chua• 2025

Related benchmarks

TaskDatasetResultRank
Spatial ReasoningVSI-Bench--
24
Spatial ReasoningSTI-Bench
D-Measure Score30.7
18
Spatial ReasoningMetro-Spatial-QA
Measurement Accuracy44.7
15
Spatial Reasoning (Video)VSI-Bench
Accuracy58.6
14
Showing 4 of 4 rows

Other info

Follow for update