Real-time Capable Learning-based Visual Tool Pose Correction via Differentiable Simulation

About

Autonomy in robot-assisted minimally invasive surgery has the potential to reduce surgeon cognitive and task load, thereby increasing procedural efficiency. However, implementing accurate autonomous control can be difficult due to poor end-effector proprioception. Joint encoder readings are typically inaccurate due to kinematic non-idealities in their cable-driven transmissions. Vision-based pose estimation approaches are highly effective, but lack real-time capability, generalizability, or can be hard to train. In this work, we demonstrate a real-time capable, Vision Transformer-based pose estimation approach that is trained using end-to-end differentiable kinematics and rendering. We demonstrate the potential of this approach to correct for noisy pose estimates through a real robot dataset and the potential real-time processing ability. Our approach is able to reduce more than 50% of hand-eye translation errors in the dataset, reaching the same performance level as an existing optimization-based method. Our approach is four times faster, and capable of near real-time inference at 22 Hz. A zero-shot prediction on an unseen dataset shows good generalization ability, and can be further finetuned for increased performance without human labeling.

Shuyuan Yang, Zonghe Chua• 2025

Related benchmarks

Task	Dataset	Result
Surgical tool pose tracking	PSM end-effector hand-eye transform (test)	Roll RMSE (deg)11.8	4
Surgical Tool Tracking (Translation)	PSM end-effector hand-eye transform (test)	RMSE (mm) X2.49	4
Robot Pose Correction	dVRK (test)	Latency (ms)41.39	4
Surgical robot end-effector pose estimation	Novel unseen configuration (test)	RMSE X (mm)2.12	2

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord