Real-time Capable Learning-based Visual Tool Pose Correction via Differentiable Simulation
About
Autonomy in robot-assisted minimally invasive surgery has the potential to reduce surgeon cognitive and task load, thereby increasing procedural efficiency. However, implementing accurate autonomous control can be difficult due to poor end-effector proprioception. Joint encoder readings are typically inaccurate due to kinematic non-idealities in their cable-driven transmissions. Vision-based pose estimation approaches are highly effective, but lack real-time capability, generalizability, or can be hard to train. In this work, we demonstrate a real-time capable, Vision Transformer-based pose estimation approach that is trained using end-to-end differentiable kinematics and rendering. We demonstrate the potential of this approach to correct for noisy pose estimates through a real robot dataset and the potential real-time processing ability. Our approach is able to reduce more than 50% of hand-eye translation errors in the dataset, reaching the same performance level as an existing optimization-based method. Our approach is four times faster, and capable of near real-time inference at 22 Hz. A zero-shot prediction on an unseen dataset shows good generalization ability, and can be further finetuned for increased performance without human labeling.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Surgical tool pose tracking | PSM end-effector hand-eye transform (test) | Roll RMSE (deg)11.8 | 4 | |
| Surgical Tool Tracking (Translation) | PSM end-effector hand-eye transform (test) | RMSE (mm) X2.49 | 4 | |
| Robot Pose Correction | dVRK (test) | Latency (ms)41.39 | 4 | |
| Surgical robot end-effector pose estimation | Novel unseen configuration (test) | RMSE X (mm)2.12 | 2 |