STORM: Segment, Track, and Object Re-Localization from a Single Image
About
Accurate 6D pose estimation and tracking are core capabilities for physical AI systems, yet real-world deployment remains brittle and labor-intensive. Many pipelines rely on CAD models, manual masking, or per-object adaptation, and still fail under occlusion or fast motion without a principled way to recognize failure. We propose STORM, a unified framework for reference-conditioned 6D tracking that can operate from a single reference image, with minimal manual input and improved robustness. STORM combines: (i) Hierarchical Spatial Fusion Attention (HSFA), a task-driven reference-query fusion architecture that supports both single-reference and multi-reference conditioning and can optionally use vision-language semantic conditioning to resolve instance ambiguities; and (ii) a BCE-trained tracking verifier whose continuous compatibility logit is used as an energy-like score to detect drift and trigger automatic re-initialization. Experiments on LM-O and YCB-Video show that STORM improves annotation-free pose tracking accuracy over strong baselines and recovers reliably from severe occlusions and rapid viewpoint changes with minimal overhead.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| 6D Pose Estimation | YCB-Video | AUC (ADD-S)0.98 | 151 | |
| Segmentation | BOP (test) | LM-O57.8 | 13 | |
| 6D Pose Estimation | LineMOD-Occluded | ADD-AUC74 | 3 |