Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

SurgMotion: A Video-Native Foundation Model for Universal Understanding of Surgical Videos

About

While foundation models have advanced surgical video analysis, current approaches rely predominantly on pixel-level reconstruction objectives that waste model capacity on low-level visual details, such as smoke, specular reflections, and fluid motion, rather than semantic structures essential for surgical understanding. We present SurgMotion, a video-native foundation model that shifts the learning paradigm from pixel-level reconstruction to latent motion prediction. Built on the Video Joint Embedding Predictive Architecture (V-JEPA), SurgMotion introduces three key technical innovations tailored to surgical videos: (1) motion-guided latent masked prediction to prioritize semantically meaningful regions, (2) spatiotemporal affinity self-distillation to enforce relational consistency, and (3) spatiotemporal feature diversity regularization (SFDR) to prevent representation collapse in texture-sparse surgical scenes. To enable large-scale pretraining, we curate SurgMotion-15M, the largest surgical video dataset to date, comprising 3,658 hours of video from 50 sources across 13 anatomical regions. Extensive experiments across 17 benchmarks demonstrate that SurgMotion significantly outperforms state-of-the-art methods on surgical workflow recognition, achieving 14.6 percent improvement in F1 score on EgoSurgery and 10.3 percent on PitVis; on action triplet recognition with 39.54 percent mAP-IVT on CholecT50; as well as on skill assessment, polyp segmentation, and depth estimation. These results establish SurgMotion as a new standard for universal, motion-oriented surgical video understanding.

Jinlin Wu, Felix Holm, Chuxi Chen, An Wang, Yaxin Hu, Xiaofan Ye, Zelin Zang, Miao Xu, Lihua Zhou, Huai Liao, Danny T. M. Chan, Ming Feng, Wai S. Poon, Hongliang Ren, Dong Yi, Nassir Navab, Gaofeng Meng, Jiebo Luo, Hongbin Liu, Zhen Lei• 2026

Related benchmarks

TaskDatasetResultRank
Surgical Phase RecognitionCholec80
Top-1 Accuracy91.05
65
Surgical workflow recognitionM2CAI 2016
Accuracy89.45
39
Action Triplet RecognitionCholecT50
AP (I)91.55
27
Action Quality AssessmentJIGSAWS--
20
Action RecognitionSurgicalActions160 (test)
Accuracy75.63
14
Action RecognitionPolypDiag (test)
Accuracy98.81
14
Depth EstimationC3VD
RMSE1.88
14
Surgical workflow recognitionOphNet
Accuracy73.04
14
Surgical workflow recognitionPMLR 50
Accuracy91.91
14
Surgical workflow recognitionAutolaparo
Accuracy86.37
14
Showing 10 of 19 rows

Other info

Follow for update