Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Cite-While-You-Generate: Training-Free Evidence Attribution for Multimodal Clinical Summarization

About

Trustworthy clinical summarization requires not only fluent generation but also transparency about where each statement comes from. We propose a training-free framework for generation-time source attribution that leverages decoder attentions to directly cite supporting text spans or images, overcoming the limitations of post-hoc or retraining-based methods. We introduce two strategies for multimodal attribution: a raw image mode, which directly uses image patch attentions, and a caption-as-span mode, which substitutes images with generated captions to enable purely text-based alignment. Evaluations on two representative domains: clinician-patient dialogues (CliConSummation) and radiology reports (MIMIC-CXR), show that our approach consistently outperforms embedding-based and self-attribution baselines, improving both text-level and multimodal attribution accuracy (e.g., +15% F1 over embedding baselines). Caption-based attribution achieves competitive performance with raw-image attention while being more lightweight and practical. These findings highlight attention-guided attribution as a promising step toward interpretable and deployable clinical summarization systems.

Qianqi Yan, Huy Nguyen, Sumana Srivatsa, Hari Bandi, Xin Eric Wang, Krishnaram Kenthapadi• 2026

Related benchmarks

TaskDatasetResultRank
AttributionCLICONSUMMATION multimodal (163 samples)
Text F165.52
8
AttributionCLICONSUMMATION text-only (256 samples)
Macro-F176.33
6
Showing 2 of 2 rows

Other info

Follow for update