Articulation in Prime: Primitive-Based Articulated Object Understanding from a Single Casual Video
About
Retrieving the 3D kinematics of articulated objects from monocular video is a fundamental challenge in computer vision. Existing methods rely on complex video setups or cues such as long-term point tracking or wide-baseline matching, but are frequently brittle under severe occlusions, rapid camera ego-motion, or weak local features. Learning-based methods, meanwhile, struggle to generalize beyond their training categories. We propose a category-agnostic optimization framework that treats articulated object understanding as a primitive-fitting problem. Geometric primitives serve as a proxy representation that avoids the pitfalls of unstable point tracks; a novel mechanism organizes them into coherent parts constrained by revolute and prismatic joints. Our formulation jointly optimizes part segmentation and joint parameters, recovering complex kinematics from a single casually captured video. A visibility-aware procedure handles partial observations and occlusions inherent to real-world data. We also propose the AiP-synth and AiP-real benchmarks, featuring significant camera motion and heavy occlusions, and outperform existing methods. Project page: https://aartykov.github.io/Articulation-in-Prime/
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Articulated Object Modeling | AiP real | Axis Alignment Error (Deg)1.13 | 22 | |
| Articulation Estimation | Video2Articulation Revolute S (test) | Axis Error (°)0.00e+0 | 4 | |
| Articulation Estimation | Arti4D (test) | Axis Error3.6 | 4 | |
| Articulation Estimation | Video2Articulation Prismatic S (test) | Axis Accuracy (°)0.00e+0 | 3 |