A World Model of Radiologist Reading for Medical Image Representation Learning

About

Radiologist eye-tracking data provide a rich record of how experts search, compare, and accumulate evidence during image reading; yet, existing methods exploit this signal only partially, either as a static spatial prior or as an auxiliary prediction target decoupled from diagnosis. We propose GazeWorld, a medical imaging world model that treats the image as the world and the radiologist's fixation sequence as a trajectory through it. GazeWorld autoregressively predicts the latent representation of the next fixated patch from all previously visited ones, while a spatial-completion branch covers unvisited regions. At inference, GazeWorld generates a sequence of patch representations from the image alone without requiring real gaze data. Frozen GazeWorld features achieve state-of-the-art diagnostic accuracy across all nine supervised settings on CheXpert, RSNA Pneumonia, and SIIM-ACR Pneumothorax, as well as the highest zero-shot accuracy on all three benchmarks. On the GazeSearch benchmark, a generic decoder trained on the same frozen features outperforms the purpose-built LogitGaze-Med by over 16\% in ScanMatch and 22\% in SED, despite not being explicitly trained to predict gaze. GazeWorld demonstrates that modeling how experts read, not just what they conclude, offers a promising pretraining paradigm for medical imaging AI.

Yiwei Li, Zihao Wu, Huaqin Zhao, Yifan Zhou, Chao Cao, Dajiang Zhu, Tianming Liu, Lin Zhao• 2026

Related benchmarks

Task	Dataset	Result
Image Classification	CheXpert 5X200	Accuracy59.42	28
Image Classification	SIIM-ACR	Accuracy66.59	25
Diagnostic Classification	RSNA Pneumonia	AUROC (1% Labels)83.27	9
Scanpath Prediction	GazeSearch	SM48.9	9
Diagnostic Classification	CheXpert	AUROC (1% Labels)78.37	8
Diagnostic Classification	SIIM-ACR Pneumothorax	AUROC (1% labels)87.85	8
Classification	RSNA	AUROC62.84	6

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord