Learning Multi-modal Representations by Watching Hundreds of Surgical Video Lectures

About

Recent advancements in surgical computer vision applications have been driven by vision-only models, which do not explicitly integrate the rich semantics of language into their design. These methods rely on manually annotated surgical videos to predict a fixed set of object categories, limiting their generalizability to unseen surgical procedures and downstream tasks. In this work, we put forward the idea that the surgical video lectures available through open surgical e-learning platforms can provide effective vision and language supervisory signals for multi-modal representation learning without relying on manual annotations. We address the surgery-specific linguistic challenges present in surgical video lectures by employing multiple complementary automatic speech recognition systems to generate text transcriptions. We then present a novel method, SurgVLP - Surgical Vision Language Pre-training, for multi-modal representation learning. Extensive experiments across diverse surgical procedures and tasks demonstrate that the multi-modal representations learned by SurgVLP exhibit strong transferability and adaptability in surgical video analysis. Furthermore, our zero-shot evaluations highlight SurgVLP's potential as a general-purpose foundation model for surgical workflow analysis, reducing the reliance on extensive manual annotations for downstream tasks, and facilitating adaptation methods such as few-shot learning to build a scalable and data-efficient solution for various downstream surgical applications. The [training code](https://github.com/CAMMA-public/PeskaVLP) and [weights](https://github.com/CAMMA-public/SurgVLP) are public.

Kun Yuan, Vinkle Srivastav, Tong Yu, Joel L. Lavanchy, Jacques Marescaux, Pietro Mascagni, Nassir Navab, Nicolas Padoy• 2023

Related benchmarks

Task	Dataset	Result
Surgical Phase Recognition	Cholec80	Accuracy63.5	70
Surgical workflow recognition	M2CAI 2016	Accuracy75.85	39
Surgical Phase Recognition	Autolaparo	Average F116.6	39
Phase Recognition	Cholec80 (test)	F1 Score0.244	37
Phase Recognition	AutoLaparo (test)	F1 Score16.6	30
Critical View of Safety recognition	EndoScapes-CVS201 (test)	mAP51.4	27
Action Triplet Recognition	CholecT50	AP (I)75.56	27
Phase Recognition	Cholec80	--	24
Semantic segmentation	CholecSeg8K (test)	mIoU19.5	23
Phase Recognition	BernBypass70 (test)	Top-1 Accuracy11.4	21

Showing 10 of 54 rows

Other info

Follow for update

@wizwand_team Discord