Scaling up self-supervised learning for improved surgical foundation models

About

Foundation models have revolutionized computer vision by achieving vastly superior performance across diverse tasks through large-scale pretraining on extensive datasets. However, their application in surgical computer vision has been limited. This study addresses this gap by introducing SurgeNetXL, a novel surgical foundation model that sets a new benchmark in surgical computer vision. Trained on the largest reported surgical dataset to date, comprising over 4.7 million video frames, SurgeNetXL achieves consistent top-tier performance across six datasets spanning four surgical procedures and three tasks, including semantic segmentation, phase recognition, and critical view of safety (CVS) classification. Compared with the best-performing surgical foundation models, SurgeNetXL shows mean improvements of 2.4, 9.0, and 12.6 percent for semantic segmentation, phase recognition, and CVS classification, respectively. Additionally, SurgeNetXL outperforms the best-performing ImageNet-based variants by 14.4, 4.0, and 1.6 percent in the respective tasks. In addition to advancing model performance, this study provides key insights into scaling pretraining datasets, extending training durations, and optimizing model architectures specifically for surgical computer vision. These findings pave the way for improved generalizability and robustness in data-scarce scenarios, offering a comprehensive framework for future research in this domain. All models and a subset of the SurgeNetXL dataset, including over 2 million video frames, are publicly available at: https://github.com/TimJaspers0801/SurgeNet.

Tim J.M. Jaspers, Ronald L.P.D. de Jong, Yiping Li, Carolus H.J. Kusters, Franciscus H.A. Bakker, Romy C. van Jaarsveld, Gino M. Kuiper, Richard van Hillegersberg, Jelle P. Ruurda, Willem M. Brinkman, Josien P.W. Pluim, Peter H.N. de With, Marcel Breeuwer, Yasmina Al Khalil, Fons van der Sommen• 2025

Related benchmarks

Task	Dataset	Result
Surgical Phase Recognition	Cholec80	Accuracy73.2	70
Surgical Phase Recognition	MultiBypass140	Phase-level Precision0.7347	39
Surgical Phase Recognition	Autolaparo	Average F157	39
Surgical workflow recognition	M2CAI 2016	Accuracy69.87	39
Depth Estimation	Hamlyn	Abs Rel0.1715	31
Surgical Phase Recognition	Cholec80 (test)	Precision79.22	28
Action Triplet Recognition	CholecT50	AP (I)83.22	27
Monocular Depth Estimation	SCARED	Abs Rel0.1329	27
Instance Segmentation	Grasp	mAP (Mask)0.5596	26
Closed-ended Visual Question Answering	PitVQA	F1 Score59.17	26

Showing 10 of 37 rows

Other info

Follow for update

@wizwand_team Discord