OMNI-PoseX: A Fast Vision Model for 6D Object Pose Estimation in Embodied Tasks

About

Accurate 6D object pose estimation is a fundamental capability for embodied agents, yet remains highly challenging in open-world environments. Many existing methods often rely on closed-set assumptions or geometry-agnostic regression schemes, limiting their generalization, stability, and real-time applicability in robotic systems. We present OMNI-PoseX, a vision foundation model that introduces a novel network architecture unifying open-vocabulary perception with an SO(3)-aware reflected flow matching pose predictor. The architecture decouples object-level understanding from geometry-consistent rotation inference, and employs a lightweight multi-modal fusion strategy that conditions rotation-sensitive geometric features on compact semantic embeddings, enabling efficient and stable 6D pose estimation. To enhance robustness and generalization, the model is trained on large-scale 6D pose datasets, leveraging broad object diversity, viewpoint variation, and scene complexity to build a scalable open-world pose backbone. Comprehensive evaluations across benchmark pose estimation, ablation studies, zero-shot generalization, and system-level robotic grasping integration demonstrate the effectiveness of OMNI-PoseX. The OMNI-PoseX achieves SOTA pose accuracy and real-time efficiency, while delivering geometrically consistent predictions that enable reliable grasping of diverse, previously unseen objects.

Michael Zhang, Wei Ying, Fangwen Chen, Shifeng Bai, Hanwen Kang• 2026

Related benchmarks

Task	Dataset	Result	Rank
6D Pose Estimation	OMNI6DPOSE (test)	Success Rate (5° 2cm)11.57		15
6D Pose Estimation	Isaac Sim Embodied Tasks (unseen)	AUC @2535.6		5

Showing 2 of 2 rows

Other info

Follow for update

@wizwand_team Discord