GREEN: Generative Radiology Report Evaluation and Error Notation

About

Evaluating radiology reports is a challenging problem as factual correctness is extremely important due to the need for accurate medical communication about medical images. Existing automatic evaluation metrics either suffer from failing to consider factual correctness (e.g., BLEU and ROUGE) or are limited in their interpretability (e.g., F1CheXpert and F1RadGraph). In this paper, we introduce GREEN (Generative Radiology Report Evaluation and Error Notation), a radiology report generation metric that leverages the natural language understanding of language models to identify and explain clinically significant errors in candidate reports, both quantitatively and qualitatively. Compared to current metrics, GREEN offers: 1) a score aligned with expert preferences, 2) human interpretable explanations of clinically significant errors, enabling feedback loops with end-users, and 3) a lightweight open-source method that reaches the performance of commercial counterparts. We validate our GREEN metric by comparing it to GPT-4, as well as to error counts of 6 experts and preferences of 2 experts. Our method demonstrates not only higher correlation with expert error counts, but simultaneously higher alignment with expert preferences when compared to previous approaches.

Sophie Ostmeier, Justin Xu, Zhihong Chen, Maya Varma, Louis Blankemeier, Christian Bluethgen, Arne Edward Michalson, Michael Moseley, Curtis Langlotz, Akshay S Chaudhari, Jean-Benoit Delbrouck• 2024

Related benchmarks

Task	Dataset	Result
Radiology Report Meta-Evaluation	ReEvalMed	D Score53.5	43
Evaluation of metrics' alignment with human judgment	RadEvalX	Spearman Correlation0.4022	17
Radiology report discrepancy evaluation	RadEvalX 100-pair	Spearman Correlation (Total)0.57	15
Correlation with radiologist-derived clinically significant error counts	ReXVal BLEU-optimized candidate reports (n = 50)	Kendall Tau0.8	12
Correlation with radiologist-derived clinically significant error counts	ReXVal BERTScore-optimized candidate reports (n = 50)	Kendall Tau0.75	12
Correlation with radiologist-derived clinically significant error counts	ReXVal CheXbert-optimized candidate reports (n = 50)	Kendall's τ0.71	12
Correlation with radiologist-derived clinically significant error counts	ReXVal RadGraph-optimized candidate reports (n = 50)	Kendall τ0.71	12
Clinical error evaluation	ReEvalMed (test)	D Error Score53.5	11
Metric Correlation with Human Judgment	Merlin	Pearson Correlation0.309	7
Metric Correlation with Human Judgment	CT-RATE	Pearson Correlation0.111	7

Showing 10 of 13 rows

Other info

Follow for update

@wizwand_team Discord