Target Adaptive Context Aggregation for Video Scene Graph Generation
About
This paper deals with a challenging task of video scene graph generation (VidSGG), which could serve as a structured video representation for high-level understanding tasks. We present a new {\em detect-to-track} paradigm for this task by decoupling the context modeling for relation prediction from the complicated low-level entity tracking. Specifically, we design an efficient method for frame-level VidSGG, termed as {\em Target Adaptive Context Aggregation Network} (TRACE), with a focus on capturing spatio-temporal context information for relation recognition. Our TRACE framework streamlines the VidSGG pipeline with a modular design, and presents two unique blocks of Hierarchical Relation Tree (HRTree) construction and Target-adaptive Context Aggregation. More specific, our HRTree first provides an adpative structure for organizing possible relation candidates efficiently, and guides context aggregation module to effectively capture spatio-temporal structure information. Then, we obtain a contextualized feature representation for each relation candidate and build a classification head to recognize its relation category. Finally, we provide a simple temporal association strategy to track TRACE detected results to yield the video-level VidSGG. We perform experiments on two VidSGG benchmarks: ImageNet-VidVRD and Action Genome, and the results demonstrate that our TRACE achieves the state-of-the-art performance. The code and models are made available at \url{https://github.com/MCG-NJU/TRACE}.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Relation Detection | VRD (test) | R@509.08 | 75 | |
| PredCLS | Action Genome (test) | Recall@1072.6 | 54 | |
| Scene Graph Classification | Action Genome (test) | Recall@1037.1 | 40 | |
| Scene Graph Detection (SGDet) | Action Genome v1.0 (test) | R@1026.5 | 32 | |
| Scene Graph Detection | Action Genome | Recall@1026.5 | 30 | |
| Predicate Classification | Action Genome | Recall@1072.6 | 26 | |
| Relation Tagging | VidVRD v1.0 (test) | P@545.3 | 18 | |
| Relation Detection | VidVRD v1.0 (test) | R@509.08 | 18 | |
| Relation Tagging | VidVRD (test) | P@161 | 14 | |
| SGCLS | Action Genome (test) | Recall@1014.8 | 14 |