Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

EndoMamba: An Efficient Foundation Model for Endoscopic Videos via Hierarchical Pre-training

About

Endoscopic video-based tasks, such as visual navigation and surgical phase recognition, play a crucial role in minimally invasive surgeries by providing real-time assistance. While recent video foundation models have shown promise, their applications are hindered by (1) computational inefficiencies and (2) suboptimal performance caused by limited data for pre-training in endoscopy. To address these issues, we present EndoMamba, a foundation model designed for real-time inference while learning generalized spatiotemporal representations. First, to mitigate computational inefficiencies, we propose the EndoMamba backbone, optimized for real-time inference. Inspired by recent advancements in state space models, EndoMamba integrates Bidirectional Mamba blocks for spatial modeling within individual frames and vanilla Mamba blocks for past-to-present reasoning across the temporal domain. This design enables both strong spatiotemporal modeling and efficient inference in online video streams. Second, we propose a self-supervised hierarchical pre-training diagram to enhance EndoMamba's representation learning using endoscopic videos and incorporating general video domain knowledge. Specifically, our approach combines masked reconstruction with auxiliary supervision, leveraging low-level reconstruction to capture spatial-temporal structures and high-level alignment to transfer broader knowledge from a pretrained general-video domain foundation model. Extensive experiments on four downstream tasks--classification, segmentation, surgical phase recognition, and localization--demonstrate that EndoMamba outperforms existing foundation models and task-specific methods while maintaining real-time inference speed. The source code is available at https://github.com/TianCuteQY/EndoMamba.

Qingyao Tian, Huai Liao, Xinyan Huang, Bingyu Yang, Dongdong Lei, Sebastien Ourselin, Hongbin Liu• 2025

Related benchmarks

TaskDatasetResultRank
Surgical Phase RecognitionCholec80
Top-1 Accuracy68.81
65
Surgical workflow recognitionM2CAI 2016
Accuracy60
39
Action Triplet RecognitionCholecT50
AP (I)58.45
27
Surgical Phase RecognitionCholec80 (test)--
16
DetectionKUMC (test)
F1 Score88.8
14
SegmentationCVC-12k (test)
Dice Score (%)84.5
14
ClassificationPolypDiag (test)
F1 Score94.5
14
Surgical workflow recognitionAutolaparo
Accuracy80.96
14
Action RecognitionSurgicalActions160 (test)
Accuracy58.45
14
Surgical workflow recognitionOphNet
Accuracy28.94
14
Showing 10 of 21 rows

Other info

Follow for update