Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

High-resolution open-vocabulary object 6D pose estimation

About

The generalisation to unseen objects in the 6D pose estimation task is very challenging. While Vision-Language Models (VLMs) enable using natural language descriptions to support 6D pose estimation of unseen objects, these solutions underperform compared to model-based methods. In this work we present Horyon, an open-vocabulary VLM-based architecture that addresses relative pose estimation between two scenes of an unseen object, described by a textual prompt only. We use the textual prompt to identify the unseen object in the scenes and then obtain high-resolution multi-scale features. These features are used to extract cross-scene matches for registration. We evaluate our model on a benchmark with a large variety of unseen objects across four datasets, namely REAL275, Toyota-Light, Linemod, and YCB-Video. Our method achieves state-of-the-art performance on all datasets, outperforming by 12.6 in Average Recall the previous best-performing approach.

Jaime Corsetti, Davide Boscaini, Francesco Giuliari, Changjae Oh, Andrea Cavallaro, Fabio Poiesi• 2024

Related benchmarks

TaskDatasetResultRank
6D Object Pose EstimationToyota-Light (TOYL) (test)
AR33
18
6D Object Pose EstimationREAL275
ADD(-S)51.6
11
6D Object Pose EstimationREAL275 (test)
AR57.9
8
Relative Pose EstimationToyota-Light
ADD(-S)25.1
7
Relative Pose EstimationYCB-Video
ADD(-S)22.6
5
Relative Pose EstimationLineMOD
ADD(-S)27.6
5
Showing 6 of 6 rows

Other info

Follow for update