ConceptPose: Training-Free Zero-Shot Object Pose Estimation using Concept Vectors

About

Object pose estimation is a fundamental task in computer vision and robotics, yet most methods require extensive, dataset-specific training. Concurrently, large-scale vision language models show remarkable zero-shot capabilities. In this work, we bridge these two worlds by introducing ConceptPose, a framework for object pose estimation that is both training-free and model-free. ConceptPose leverages a vision-language-model (VLM) to create open-vocabulary 3D concept maps, where each point is tagged with a concept vector derived from saliency maps. By establishing robust 3D-3D correspondences across concept maps, our approach allows precise estimation of 6DoF relative pose. Without any object or dataset-specific training, our approach achieves state-of-the-art results on common zero shot relative pose estimation benchmarks, outperforming the strongest baseline by a relative 62\% in average ADD(-S) score, including methods that utilize extensive dataset-specific training.

Liming Kuang, Yordanka Velikova, Mahdi Saleh, Jan-Nico Zaech, Danda Pani Paudel, Benjamin Busam• 2025

Related benchmarks

Task	Dataset	Result
6D Object Pose Estimation	REAL275	ADD(-S)71.5	11
Relative Pose Estimation	Toyota-Light	ADD(-S)55	7
Relative Pose Estimation	YCB-Video	ADD(-S)41.2	5
Relative Pose Estimation	LineMOD	ADD(-S)38.6	5
Pose Tracking	YCB-V Few-shot tracking	ADD-AUC90.1	3

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord