General surgery vision transformer: A video pre-trained foundation model for general surgery

About

The absence of openly accessible data and specialized foundation models is a major barrier for computational research in surgery. Toward this, (i) we open-source the largest dataset of general surgery videos to-date, consisting of 680 hours of surgical videos, including data from robotic and laparoscopic techniques across 28 procedures; (ii) we propose a technique for video pre-training a general surgery vision transformer (GSViT) on surgical videos based on forward video prediction that can run in real-time for surgical applications, toward which we open-source the code and weights of GSViT; (iii) we also release code and weights for procedure-specific fine-tuned versions of GSViT across 10 procedures; (iv) we demonstrate the performance of GSViT on the Cholec80 phase annotation task, displaying improved performance over state-of-the-art single frame predictors.

Samuel Schmidgall, Ji Woong Kim, Jeffrey Jopling, Axel Krieger• 2024

Related benchmarks

Task	Dataset	Result
Surgical Phase Recognition	Cholec80	Top-1 Accuracy75.3	65
Surgical Phase Recognition	MultiBypass140	Phase-level Precision0.5015	39
Surgical Phase Recognition	Autolaparo	Average F117.5	39
Surgical workflow recognition	M2CAI 2016	Accuracy36.02	39
Surgical Phase Recognition	Cholec80 (test)	Precision54.54	28
Monocular Depth Estimation	SCARED	Abs Rel0.2246	27
Action Triplet Recognition	CholecT50	AP (I)32.27	27
Skill Assessment	Cholec80 CVS	mAP0.3126	26
Closed-ended Visual Question Answering	LLS48-VQA	F1 Score5.68	26
Open-Ended Visual Question Answering	LLS48-VQA	BLEU-10.4167	26

Showing 10 of 42 rows

Other info

Follow for update

@wizwand_team Discord