Pixel-Wise Recognition for Holistic Surgical Scene Understanding
About
This paper presents the Holistic and Multi-Granular Surgical Scene Understanding of Prostatectomies (GraSP) dataset, a curated benchmark that models surgical scene understanding as a hierarchy of complementary tasks with varying levels of granularity. Our approach encompasses long-term tasks, such as surgical phase and step recognition, and short-term tasks, including surgical instrument segmentation and atomic visual actions detection. To exploit our proposed benchmark, we introduce the Transformers for Actions, Phases, Steps, and Instrument Segmentation (TAPIS) model, a general architecture that combines a global video feature extractor with localized region proposals from an instrument segmentation model to tackle the multi-granularity of our benchmark. Through extensive experimentation in ours and alternative benchmarks, we demonstrate TAPIS's versatility and state-of-the-art performance across different tasks. This work represents a foundational step forward in Endoscopic Vision, offering a novel framework for future research towards holistic surgical scene understanding.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Phase Recognition | GraSP (test) | mAP76.72 | 10 | |
| Phase Recognition | MISAW | mAP97.14 | 10 | |
| Instrument Semantic Segmentation | GraSP (cross-validation) | mIoU0.8705 | 8 | |
| Surgical Phase Recognition | HeiChole | F1 Score0.7341 | 8 | |
| Gesture Recognition | RARP-45 (test) | mAP57.25 | 6 | |
| Instrument Instance Segmentation | GraSP (cross-validation) | mAP@0.5 (Box)92.65 | 6 | |
| Instrument Presence Recognition | Grasp | mAP94.33 | 6 | |
| Phase Recognition | MISAW (test) | mAP97.14 | 6 | |
| Step Recognition | MISAW (test) | mAP77.52 | 6 | |
| Atomic Action Detection | GraSP (test) | mAP@0.5 IoU (Box)39.26 | 4 |