HyperGLM: HyperGraph for Video Scene Graph Generation and Anticipation

About

Multimodal LLMs have advanced vision-language tasks but still struggle with understanding video scenes. To bridge this gap, Video Scene Graph Generation (VidSGG) has emerged to capture multi-object relationships across video frames. However, prior methods rely on pairwise connections, limiting their ability to handle complex multi-object interactions and reasoning. To this end, we propose Multimodal LLMs on a Scene HyperGraph (HyperGLM), promoting reasoning about multi-way interactions and higher-order relationships. Our approach uniquely integrates entity scene graphs, which capture spatial relationships between objects, with a procedural graph that models their causal transitions, forming a unified HyperGraph. Significantly, HyperGLM enables reasoning by injecting this unified HyperGraph into LLMs. Additionally, we introduce a new Video Scene Graph Reasoning (VSGR) dataset featuring 1.9M frames from third-person, egocentric, and drone views and supports five tasks: Scene Graph Generation, Scene Graph Anticipation, Video Question Answering, Video Captioning, and Relation Reasoning. Empirically, HyperGLM consistently outperforms state-of-the-art methods across five tasks, effectively modeling and reasoning complex relationships in diverse video scenes.

Trong-Thuan Nguyen, Pha Nguyen, Jackson Cothren, Alper Yilmaz, Khoa Luu• 2024

Related benchmarks

Task	Dataset	Result
Scene Graph Anticipation	Action Genome (test)	R@1038.8	8
Scene Graph Anticipation	VSGR (test)	R@1030.2	8
Scene Graph Generation	PVSG (test)	R@207.5	5
Scene Graph Generation	VSGR (test)	R@2035.8	5
Video Question Answering	VSGR	Accuracy45.4	5
Relation Reasoning	VSGR	Accuracy47.2	4
Video Captioning	VSGR	CIDEr54.5	4

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord