Hi-End-MAE: Hierarchical encoder-driven masked autoencoders are stronger vision learners for medical image segmentation
About
Medical image segmentation remains a formidable challenge due to the label scarcity. Pre-training Vision Transformer (ViT) through masked image modeling (MIM) on large-scale unlabeled medical datasets presents a promising solution, providing both computational efficiency and model generalization for various downstream tasks. However, current ViT-based MIM pre-training frameworks predominantly emphasize local aggregation representations in output layers and fail to exploit the rich representations across different ViT layers that better capture fine-grained semantic information needed for more precise medical downstream tasks. To fill the above gap, we hereby present Hierarchical Encoder-driven MAE (Hi-End-MAE), a simple yet effective ViT-based pre-training solution, which centers on two key innovations: (1) Encoder-driven reconstruction, which encourages the encoder to learn more informative features to guide the reconstruction of masked patches; and (2) Hierarchical dense decoding, which implements a hierarchical decoding structure to capture rich representations across different layers. We pre-train Hi-End-MAE on a large-scale dataset of 10K CT scans and evaluated its performance across seven public medical image segmentation benchmarks. Extensive experiments demonstrate that Hi-End-MAE achieves superior transfer learning capabilities across various downstream tasks, revealing the potential of ViT in medical imaging applications. The code is available at: https://github.com/FengheTan9/Hi-End-MAE
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Segmentation | WORD | 1-shot Acc77.24 | 13 | |
| Segmentation | AMOS | 1-shot Score63.23 | 13 | |
| Segmentation | BTCV | 1-shot Score72.45 | 13 | |
| Segmentation | BraTS 21 | Performance (1-shot)54.46 | 13 | |
| Classification | CC-CCII 100% ratio (train) | Accuracy92.59 | 10 | |
| Classification | CC-CCII (10% train ratio) | Accuracy78.76 | 10 | |
| Classification | CC-CCII 50% ratio (train) | Accuracy88.15 | 10 | |
| Classification | CC-CCII Average across ratios | Accuracy86.5 | 10 |