Learning Canonical Shape Space for Category-Level 6D Object Pose and Size Estimation
About
We present a novel approach to category-level 6D object pose and size estimation. To tackle intra-class shape variations, we learn canonical shape space (CASS), a unified representation for a large variety of instances of a certain object category. In particular, CASS is modeled as the latent space of a deep generative model of canonical 3D shapes with normalized pose. We train a variational auto-encoder (VAE) for generating 3D point clouds in the canonical space from an RGBD image. The VAE is trained in a cross-category fashion, exploiting the publicly available large 3D shape repositories. Since the 3D point cloud is generated in normalized pose (with actual size), the encoder of the VAE learns view-factorized RGBD embedding. It maps an RGBD image in arbitrary view into a pose-independent 3D shape representation. Object pose is then estimated via contrasting it with a pose-dependent feature of the input RGBD extracted with a separate deep neural networks. We integrate the learning of CASS and pose and size estimation into an end-to-end trainable network, achieving the state-of-the-art performance.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Category-level 6D Pose Estimation | REAL275 (test) | Pose Acc (5°/5cm)23.5 | 53 | |
| Category-level 6D Object Pose Estimation | REAL275 | mAP (5°5cm)23.5 | 16 | |
| Pose Estimation | NOCS (test) | mAP IoU 5077.7 | 10 | |
| Pose Estimation | NOCS REAL275 (test) | mAP (IoU=0.50)0.777 | 10 | |
| Category-level 9D Pose Estimation | NOCS REAL275 (test) | mAP (5° 5cm)23.5 | 9 | |
| 6D Pose Estimation | NOCS REAL275 | Accuracy (5°5cm)23.5 | 7 | |
| 3D Object Detection | NOCS CAMERA25 | IoU@2584.2 | 6 | |
| 6D Pose Estimation | NOCS CAMERA25 | Success Rate (5°5cm)23.5 | 6 | |
| 3D Object Detection | NOCS REAL275 | IoU@25%70.7 | 6 | |
| Shape Reconstruction | NOCS | Shape Error (Bottle)0.75 | 5 |