Hi-End-MAE: Hierarchical encoder-driven masked autoencoders are stronger vision learners for medical image segmentation

About

Medical image segmentation remains a formidable challenge due to the label scarcity. Pre-training Vision Transformer (ViT) through masked image modeling (MIM) on large-scale unlabeled medical datasets presents a promising solution, providing both computational efficiency and model generalization for various downstream tasks. However, current ViT-based MIM pre-training frameworks predominantly emphasize local aggregation representations in output layers and fail to exploit the rich representations across different ViT layers that better capture fine-grained semantic information needed for more precise medical downstream tasks. To fill the above gap, we hereby present Hierarchical Encoder-driven MAE (Hi-End-MAE), a simple yet effective ViT-based pre-training solution, which centers on two key innovations: (1) Encoder-driven reconstruction, which encourages the encoder to learn more informative features to guide the reconstruction of masked patches; and (2) Hierarchical dense decoding, which implements a hierarchical decoding structure to capture rich representations across different layers. We pre-train Hi-End-MAE on a large-scale dataset of 10K CT scans and evaluated its performance across seven public medical image segmentation benchmarks. Extensive experiments demonstrate that Hi-End-MAE achieves superior transfer learning capabilities across various downstream tasks, revealing the potential of ViT in medical imaging applications. The code is available at: https://github.com/FengheTan9/Hi-End-MAE

Fenghe Tang, Qingsong Yao, Wenxin Ma, Chenxu Wu, Zihang Jiang, S. Kevin Zhou• 2025

Related benchmarks

Task	Dataset	Result
Segmentation	WORD	1-shot Acc77.24	13
Segmentation	AMOS	1-shot Score63.23	13
Segmentation	BTCV	1-shot Score72.45	13
Segmentation	BraTS 21	Performance (1-shot)54.46	13
Classification	CC-CCII 100% ratio (train)	Accuracy92.59	10
Classification	CC-CCII (10% train ratio)	Accuracy78.76	10
Classification	CC-CCII 50% ratio (train)	Accuracy88.15	10
Classification	CC-CCII Average across ratios	Accuracy86.5	10

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord