Hierarchical Open-Vocabulary 3D Scene Graphs for Language-Grounded Robot Navigation
About
Recent open-vocabulary robot mapping methods enrich dense geometric maps with pre-trained visual-language features. While these maps allow for the prediction of point-wise saliency maps when queried for a certain language concept, large-scale environments and abstract queries beyond the object level still pose a considerable hurdle, ultimately limiting language-grounded robotic navigation. In this work, we present HOV-SG, a hierarchical open-vocabulary 3D scene graph mapping approach for language-grounded robot navigation. Leveraging open-vocabulary vision foundation models, we first obtain state-of-the-art open-vocabulary segment-level maps in 3D and subsequently construct a 3D scene graph hierarchy consisting of floor, room, and object concepts, each enriched with open-vocabulary features. Our approach is able to represent multi-story buildings and allows robotic traversal of those using a cross-floor Voronoi graph. HOV-SG is evaluated on three distinct datasets and surpasses previous baselines in open-vocabulary semantic accuracy on the object, room, and floor level while producing a 75% reduction in representation size compared to dense open-vocabulary maps. In order to prove the efficacy and generalization capabilities of HOV-SG, we showcase successful long-horizon language-conditioned robot navigation within real-world multi-storage environments. We provide code and trial video data at http://hovsg.github.io/.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| 3D Semantic Segmentation | ScanNet (test) | mIoU20.76 | 109 | |
| 3D Semantic Segmentation | ScanNet | mIoU27.8 | 51 | |
| 3D Semantic Segmentation | Replica | 3D mIoU23.1 | 41 | |
| 3D Semantic Mapping | Replica | mAcc39.59 | 25 | |
| 3D Semantic Segmentation | ScanNet 3 (val) | mIoU34.4 | 11 | |
| 3D Semantic Segmentation | ScanNet200 42 (val) | mIoU11.2 | 9 | |
| Open-Vocabulary 3D Semantic Segmentation | Replica | mAcc38.07 | 8 | |
| Spatial Question Response (Object Retrieval) | HM3DSem-SQR | Accuracy (1m, ABC)27 | 7 | |
| Open-Vocabulary 3D Semantic Segmentation | Replica (test) | All IoU22.5 | 7 | |
| 3D Object Localization | GOAT-Core Scene Nfv | Average SR@582.4 | 6 |