Eye-gaze Guided Multi-modal Alignment for Medical Representation Learning
About
In the medical multi-modal frameworks, the alignment of cross-modality features presents a significant challenge. However, existing works have learned features that are implicitly aligned from the data, without considering the explicit relationships in the medical context. This data-reliance may lead to low generalization of the learned alignment relationships. In this work, we propose the Eye-gaze Guided Multi-modal Alignment (EGMA) framework to harness eye-gaze data for better alignment of medical visual and textual features. We explore the natural auxiliary role of radiologists' eye-gaze data in aligning medical images and text, and introduce a novel approach by using eye-gaze data, collected synchronously by radiologists during diagnostic evaluations. We conduct downstream tasks of image classification and image-text retrieval on four medical datasets, where EGMA achieved state-of-the-art performance and stronger generalization across different datasets. Additionally, we explore the impact of varying amounts of eye-gaze data on model performance, highlighting the feasibility and utility of integrating this auxiliary data into multi-modal alignment framework.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Radiology Report Generation | MIMIC-CXR (test) | BLEU-40.132 | 172 | |
| Image Classification | RSNA (test) | AUC90.1 | 59 | |
| Chest X-ray classification | NIH (test) | AUROC81.8 | 47 | |
| Image Classification | SIIM-ACR (test) | AUROC93.29 | 45 | |
| Classification | RSNA (test) | F1 Score84.4 | 44 | |
| Image Classification | SIIM (test) | F1 Score97.1 | 30 | |
| Image Classification | CheXpert (test) | AUC89.5 | 25 | |
| Image Classification | CheXpert 5X200 | Accuracy61.3 | 22 | |
| Image Classification | SIIM-ACR | Accuracy63.62 | 20 | |
| Lesion Segmentation | RSNA 56 | Dice Score79.69 | 12 |