Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

General surgery vision transformer: A video pre-trained foundation model for general surgery

About

The absence of openly accessible data and specialized foundation models is a major barrier for computational research in surgery. Toward this, (i) we open-source the largest dataset of general surgery videos to-date, consisting of 680 hours of surgical videos, including data from robotic and laparoscopic techniques across 28 procedures; (ii) we propose a technique for video pre-training a general surgery vision transformer (GSViT) on surgical videos based on forward video prediction that can run in real-time for surgical applications, toward which we open-source the code and weights of GSViT; (iii) we also release code and weights for procedure-specific fine-tuned versions of GSViT across 10 procedures; (iv) we demonstrate the performance of GSViT on the Cholec80 phase annotation task, displaying improved performance over state-of-the-art single frame predictors.

Samuel Schmidgall, Ji Woong Kim, Jeffrey Jopling, Axel Krieger• 2024

Related benchmarks

TaskDatasetResultRank
Surgical Phase RecognitionCholec80
Top-1 Accuracy75.3
65
Surgical Phase RecognitionMultiBypass140
Phase-level Precision0.5015
39
Surgical workflow recognitionM2CAI 2016
Accuracy36.02
39
Surgical Phase RecognitionAutolaparo
Average F117.5
36
Monocular Depth EstimationSCARED
Abs Rel0.2246
27
Action Triplet RecognitionCholecT50
AP (I)32.27
27
Skill AssessmentCholec80 CVS
mAP0.3126
26
Closed-ended Visual Question AnsweringLLS48-VQA
F1 Score5.68
26
Open-Ended Visual Question AnsweringLLS48-VQA
BLEU-10.4167
26
Semantic segmentationDSAD
DSC21.67
26
Showing 10 of 42 rows

Other info

Follow for update