Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

General surgery vision transformer: A video pre-trained foundation model for general surgery

About

The absence of openly accessible data and specialized foundation models is a major barrier for computational research in surgery. Toward this, (i) we open-source the largest dataset of general surgery videos to-date, consisting of 680 hours of surgical videos, including data from robotic and laparoscopic techniques across 28 procedures; (ii) we propose a technique for video pre-training a general surgery vision transformer (GSViT) on surgical videos based on forward video prediction that can run in real-time for surgical applications, toward which we open-source the code and weights of GSViT; (iii) we also release code and weights for procedure-specific fine-tuned versions of GSViT across 10 procedures; (iv) we demonstrate the performance of GSViT on the Cholec80 phase annotation task, displaying improved performance over state-of-the-art single frame predictors.

Samuel Schmidgall, Ji Woong Kim, Jeffrey Jopling, Axel Krieger• 2024

Related benchmarks

TaskDatasetResultRank
Surgical Phase RecognitionCholec80
Average F18.06
35
Action Triplet RecognitionCholecT50
AP (I)32.27
27
Skill AssessmentCholec80 CVS
mAP0.3126
26
Closed-ended Visual Question AnsweringLLS48-VQA
F1 Score5.68
26
Open-Ended Visual Question AnsweringLLS48-VQA
BLEU-10.4167
26
Semantic segmentationDSAD
DSC21.67
26
Closed-ended Visual Question AnsweringPitVQA
F1 Score25.74
26
Depth EstimationHamlyn
Abs Rel0.2407
26
Instance SegmentationGrasp
mAP (Mask)0.3778
26
Object DetectionGrasp
mAP (BBox)39.38
26
Showing 10 of 35 rows

Other info

Follow for update