LEMON: A Large Endoscopic MONocular Dataset and Foundation Model for Perception in Surgical Settings
About
Traditional open-access datasets focusing on surgical procedures are often limited by their small size, typically consisting of fewer than 100 videos and less than 30 hours of footage, which leads to poor model generalization. To address this data limitation, a new dataset called LEMON has been compiled using a novel aggregation pipeline that collects high-resolution videos from online sources. Featuring an extensive collection of over 4K surgical videos totaling 938 hours (85 million frames) of high-quality footage across multiple procedure types, LEMON offers a comprehensive resource surpassing existing alternatives in size and scope, including two novel downstream tasks. To demonstrate the effectiveness of this diverse dataset, we introduce LemonFM, a foundation model pretrained on LEMON using a novel self-supervised augmented knowledge distillation approach. LemonFM consistently outperforms existing surgical foundation models across four downstream tasks and six datasets, achieving significant gains in surgical phase recognition (+9.5pp, +9.4pp, and +8.4pp in Jaccard on AutoLaparo, M2CAI16, and Cholec80), surgical action recognition (+4.4pp in mAP on CholecT50), surgical tool presence detection (+5.3pp and +10.2pp in mAP on Cholec80 and GraSP), and surgical semantic segmentation (+10.3pp in mDice on CholecSeg8k). LEMON and LemonFM will serve as foundational resources for the research community and industry, accelerating progress in developing autonomous robotic surgery systems and ultimately contributing to safer and more accessible surgical care worldwide. Dataset, code, and models are publicly available at https://github.com/visurg-ai/LEMON.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Surgical Phase Recognition | Cholec80 | Top-1 Accuracy92.7 | 65 | |
| Surgical workflow recognition | M2CAI 2016 | Accuracy68.4 | 39 | |
| Surgical Phase Recognition | Autolaparo | Average F166.9 | 36 | |
| Semantic segmentation | CholecSeg8K (test) | Dice Score81.3 | 20 | |
| Surgical action recognition | CholecT50 | mAP61.9 | 15 | |
| Surgical tool presence detection | Cholec80 | mAP93.7 | 15 | |
| Instrument Presence Recognition | Grasp | mAP94.4 | 14 | |
| Surgical Phase Recognition | M2CAI16 (test) | Accuracy89.9 | 10 | |
| Surgical tool presence detection | Grasp | mAP76.4 | 7 | |
| Binary video classification of surgery types | LEMON | Accuracy98.9 | 5 |