Towards Holistic Surgical Scene Understanding
About
Most benchmarks for studying surgical interventions focus on a specific challenge instead of leveraging the intrinsic complementarity among different tasks. In this work, we present a new experimental framework towards holistic surgical scene understanding. First, we introduce the Phase, Step, Instrument, and Atomic Visual Action recognition (PSI-AVA) Dataset. PSI-AVA includes annotations for both long-term (Phase and Step recognition) and short-term reasoning (Instrument detection and novel Atomic Action recognition) in robot-assisted radical prostatectomy videos. Second, we present Transformers for Action, Phase, Instrument, and steps Recognition (TAPIR) as a strong baseline for surgical scene understanding. TAPIR leverages our dataset's multi-level annotations as it benefits from the learned representation on the instrument detection task to improve its classification capacity. Our experimental results in both PSI-AVA and other publicly available databases demonstrate the adequacy of our framework to spur future research on holistic surgical scene understanding.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Phase Recognition | GraSP (test) | mAP72.59 | 10 | |
| Phase Recognition | MISAW | mAP94.24 | 10 | |
| Atomic Action Detection | GraSP (test) | mAP@0.5 IoU (Box)25.57 | 4 | |
| Instrument Segmentation | GraSP (test) | mAP@0.5 (Box)74.43 | 4 | |
| Step Recognition | GraSP (test) | mAP50.24 | 4 | |
| Atomic Action Recognition | PSI-AVA | mAP@0.528.68 | 3 | |
| Instrument Recognition | PSI-AVA | mAP@0.5IoU81.14 | 3 | |
| Phase Recognition | PSI-AVA | mAP56.55 | 3 | |
| Step Recognition | PSI-AVA | mAP45.56 | 3 | |
| Step Recognition | MISAW | mAP (%)79.18 | 2 |