Relationformer: A Unified Framework for Image-to-Graph Generation
About
A comprehensive representation of an image requires understanding objects and their mutual relationship, especially in image-to-graph generation, e.g., road network extraction, blood-vessel network extraction, or scene graph generation. Traditionally, image-to-graph generation is addressed with a two-stage approach consisting of object detection followed by a separate relation prediction, which prevents simultaneous object-relation interaction. This work proposes a unified one-stage transformer-based framework, namely Relationformer, that jointly predicts objects and their relations. We leverage direct set-based object prediction and incorporate the interaction among the objects to learn an object-relation representation jointly. In addition to existing [obj]-tokens, we propose a novel learnable token, namely [rln]-token. Together with [obj]-tokens, [rln]-token exploits local and global semantic reasoning in an image through a series of mutual associations. In combination with the pair-wise [obj]-token, the [rln]-token contributes to a computationally efficient relation prediction. We achieve state-of-the-art performance on multiple, diverse and multi-domain datasets that demonstrate our approach's effectiveness and generalizability.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Scene Graph Generation | Visual Genome (test) | R@500.284 | 86 | |
| Supervised Graph Prediction | QM9 (test) | Edit Distance3.8 | 7 | |
| Supervised Graph Prediction | Toulouse (test) | Edit Distance0.13 | 4 | |
| Graph-level Tasks | QM9 (test) | Inference Throughput (graphs/sec)10 | 4 | |
| Supervised Graph Prediction | Coloring (test) | Edit Distance5.47 | 4 | |
| Supervised Graph Prediction | GDB13 (test) | Edit Distance8.83 | 4 | |
| Supervised Graph Prediction | USCities (test) | Edit Distance2.09 | 2 |