Exploiting Long-Term Dependencies for Generating Dynamic Scene Graphs
About
Dynamic scene graph generation from a video is challenging due to the temporal dynamics of the scene and the inherent temporal fluctuations of predictions. We hypothesize that capturing long-term temporal dependencies is the key to effective generation of dynamic scene graphs. We propose to learn the long-term dependencies in a video by capturing the object-level consistency and inter-object relationship dynamics over object-level long-term tracklets using transformers. Experimental results demonstrate that our Dynamic Scene Graph Detection Transformer (DSG-DETR) outperforms state-of-the-art methods by a significant margin on the benchmark dataset Action Genome. Our ablation studies validate the effectiveness of each component of the proposed approach. The source code is available at https://github.com/Shengyu-Feng/DSG-DETR.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Scene Graph Classification | Action Genome (test) | Recall@1059.2 | 40 | |
| Scene Graph Detection (SGDet) | Action Genome v1.0 (test) | R@1032.1 | 32 | |
| Scene Graph Detection | Action Genome | Recall@1032.1 | 30 | |
| SGCLS | Action Genome (test) | Recall@1050.8 | 14 | |
| SGDET | Action Genome (test) | R@1030.3 | 14 | |
| Video Scene Graph Classification (SGCLS) | Action Genome Gaussian Noise corruption Robust VidSGG (test) | mR@1011.4 | 8 | |
| Video Scene Graph Classification (SGCLS) | Action Genome Fog corruption Robust VidSGG (test) | Mean Recall @ 1026.8 | 4 |