Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

EndoMamba: An Efficient Foundation Model for Endoscopic Videos via Hierarchical Pre-training

About

Endoscopic video-based tasks, such as visual navigation and surgical phase recognition, play a crucial role in minimally invasive surgeries by providing real-time assistance. While recent video foundation models have shown promise, their applications are hindered by (1) computational inefficiencies and (2) suboptimal performance caused by limited data for pre-training in endoscopy. To address these issues, we present EndoMamba, a foundation model designed for real-time inference while learning generalized spatiotemporal representations. First, to mitigate computational inefficiencies, we propose the EndoMamba backbone, optimized for real-time inference. Inspired by recent advancements in state space models, EndoMamba integrates Bidirectional Mamba blocks for spatial modeling within individual frames and vanilla Mamba blocks for past-to-present reasoning across the temporal domain. This design enables both strong spatiotemporal modeling and efficient inference in online video streams. Second, we propose a self-supervised hierarchical pre-training diagram to enhance EndoMamba's representation learning using endoscopic videos and incorporating general video domain knowledge. Specifically, our approach combines masked reconstruction with auxiliary supervision, leveraging low-level reconstruction to capture spatial-temporal structures and high-level alignment to transfer broader knowledge from a pretrained general-video domain foundation model. Extensive experiments on four downstream tasks--classification, segmentation, surgical phase recognition, and localization--demonstrate that EndoMamba outperforms existing foundation models and task-specific methods while maintaining real-time inference speed. The source code is available at https://github.com/TianCuteQY/EndoMamba.

Qingyao Tian, Huai Liao, Xinyan Huang, Bingyu Yang, Dongdong Lei, Sebastien Ourselin, Hongbin Liu• 2025

Related benchmarks

TaskDatasetResultRank
Surgical Phase RecognitionCholec80
Average F154.69
35
Action Triplet RecognitionCholecT50
AP (I)58.45
27
Surgical workflow recognitionAutolaparo
Accuracy80.96
14
Action RecognitionSurgicalActions160 (test)
Accuracy58.45
14
Surgical workflow recognitionOphNet
Accuracy28.94
14
Surgical workflow recognitionPMLR 50
Accuracy75.5
14
Surgical workflow recognitionEgoSurgery (test)
Accuracy53.3
14
Surgical workflow recognitionM2CAI 2016
Accuracy60
14
Surgical workflow recognitionPitVis (test)
Accuracy61.54
14
Surgical workflow recognitionAtlas-Neurosurgical (test)
Accuracy0.7037
14
Showing 10 of 17 rows

Other info

Follow for update