Multi-scale Hierarchical Vision Transformer with Cascaded Attention Decoding for Medical Image Segmentation
About
Transformers have shown great success in medical image segmentation. However, transformers may exhibit a limited generalization ability due to the underlying single-scale self-attention (SA) mechanism. In this paper, we address this issue by introducing a Multi-scale hiERarchical vIsion Transformer (MERIT) backbone network, which improves the generalizability of the model by computing SA at multiple scales. We also incorporate an attention-based decoder, namely Cascaded Attention Decoding (CASCADE), for further refinement of multi-stage features generated by MERIT. Finally, we introduce an effective multi-stage feature mixing loss aggregation (MUTATION) method for better model training via implicit ensembling. Our experiments on two widely used medical image segmentation benchmarks (i.e., Synapse Multi-organ, ACDC) demonstrate the superior performance of MERIT over state-of-the-art methods. Our MERIT architecture and MUTATION loss aggregation can be used with downstream medical image and semantic segmentation tasks.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Cardiac Segmentation | ACDC (test) | Avg Dice92.32 | 141 | |
| Medical Image Segmentation | ISIC 2018 | Dice Score89.23 | 139 | |
| Multi-organ Segmentation | Synapse multi-organ CT (test) | DSC84.9 | 95 | |
| Medical Image Segmentation | ISIC 2017 | Dice Score85.67 | 74 | |
| Cardiac Segmentation | ACDC | RV Score90.23 | 68 | |
| Medical Image Segmentation | ACDC | DSC (Avg)91.85 | 65 | |
| Medical Image Segmentation | GLAS | Dice96.91 | 60 | |
| Abdominal multi-organ segmentation | BTCV | Spleen90.7 | 58 | |
| Retinal Vessel Segmentation | DRIVE (test) | Accuracy96.89 | 52 | |
| Multi-organ Segmentation | Synapse multi-organ segmentation (test) | Avg DSC0.849 | 50 |