Exploiting Long-Term Dependencies for Generating Dynamic Scene Graphs

About

Dynamic scene graph generation from a video is challenging due to the temporal dynamics of the scene and the inherent temporal fluctuations of predictions. We hypothesize that capturing long-term temporal dependencies is the key to effective generation of dynamic scene graphs. We propose to learn the long-term dependencies in a video by capturing the object-level consistency and inter-object relationship dynamics over object-level long-term tracklets using transformers. Experimental results demonstrate that our Dynamic Scene Graph Detection Transformer (DSG-DETR) outperforms state-of-the-art methods by a significant margin on the benchmark dataset Action Genome. Our ablation studies validate the effectiveness of each component of the proposed approach. The source code is available at https://github.com/Shengyu-Feng/DSG-DETR.

Shengyu Feng, Subarna Tripathi, Hesham Mostafa, Marcel Nassar, Somdeb Majumdar• 2021

Related benchmarks

Task	Dataset	Result
Scene Graph Classification	Action Genome (test)	Recall@1059.2	55
Scene Graph Detection	Action Genome	Recall@1032.1	41
Scene Graph Detection (SGDet)	Action Genome v1.0 (test)	R@1032.1	32
SGCLS	Action Genome (test)	Recall@1050.8	21
SGDET	Action Genome (test)	R@1030.3	14
Video Scene Graph Classification (SGCLS)	Action Genome Gaussian Noise corruption Robust VidSGG (test)	mR@1011.4	8
Video Scene Graph Classification (SGCLS)	Action Genome Fog corruption Robust VidSGG (test)	Mean Recall @ 1026.8	4

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord