Revisiting Automatic Data Curation for Vision Foundation Models in Digital Pathology

About

Vision foundation models (FMs) are accelerating the development of digital pathology algorithms and transforming biomedical research. These models learn, in a self-supervised manner, to represent histological features in highly heterogeneous tiles extracted from whole-slide images (WSIs) of real-world patient samples. The performance of these FMs is significantly influenced by the size, diversity, and balance of the pre-training data. However, data selection has been primarily guided by expert knowledge at the WSI level, focusing on factors such as disease classification and tissue types, while largely overlooking the granular details available at the tile level. In this paper, we investigate the potential of unsupervised automatic data curation at the tile-level, taking into account 350 million tiles. Specifically, we apply hierarchical clustering trees to pre-extracted tile embeddings, allowing us to sample balanced datasets uniformly across the embedding space of the pretrained FM. We further identify these datasets are subject to a trade-off between size and balance, potentially compromising the quality of representations learned by FMs, and propose tailored batch sampling strategies to mitigate this effect. We demonstrate the effectiveness of our method through improved performance on a diverse range of clinically relevant downstream tasks.

Boqi Chen, C\'edric Vincent-Cuaz, Lydia A. Schoenpflug, Manuel Madeira, Lisa Fournier, Vaishnavi Subramanian, Sonali Andani, Samuel Ruiperez-Campillo, Julia E. Vogt, Rapha\"elle Luisier, Dorina Thanou, Viktor H. Koelzer, Pascal Frossard, Gabriele Campanella, Gunnar R\"atsch• 2025

Related benchmarks

Task	Dataset	Result
Biomarker Prediction	MSHS	ER96.6	5
Biomarker Prediction	MSKCC	HRD Score79.1	5
Detection	IBD	IBD Score97	5
RoI-level classification	BRACS	bACC69.3	5
RoI-level classification	BreakHis	bACC98.3	5
RoI-level classification	BACH	Balanced Accuracy91.2	5
RoI-level classification	CRC	Balanced Accuracy91.9	5
RoI-level classification	UniToPatho	bACC43.7	5
RoI-level classification	Chaoyang	bACC78.8	5
RoI-level classification	Overall average across datasets	bACC82	5

Showing 10 of 14 rows

Other info

Code

Follow for update

@wizwand_team Discord