Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Real-Time Visual Attribution Streaming in Thinking Model

About

We present an amortized framework for real-time visual attribution streaming in multimodal thinking models. When these models generate code from a screenshot or solve math problems from images, their long reasoning traces should be grounded in visual evidence. However, verifying this reliance is challenging: faithful causal methods require costly repeated backward passes or perturbations, while raw attention maps offer instant access, they lack causal validity. To resolve this, we introduce an amortized approach that learns to estimate the causal effects of semantic regions directly from the rich signals encoded in attention features. Across five diverse benchmarks and four thinking models, our approach achieves faithfulness comparable to exhaustive causal methods while enabling visual attribution streaming, where users observe grounding evidence as the model reasons, not after. Our results demonstrate that real-time, faithful attribution in multimodal thinking models is achievable through lightweight learning, not brute-force computation.

Seil Kang, Woojung Han, Junhyeok Kim, Jinyeong Kim, Youngeun Kim, Seong Jae Hwang• 2026

Related benchmarks

TaskDatasetResultRank
Referring SegmentationRefCOCO (val)--
84
Image SegmentationCOCO
mIoU24
39
Visual AttributionThinking-Model Attribution Dataset Document
LDS0.74
24
Visual AttributionThinking-Model Attribution Dataset General
LDS0.73
24
Visual AttributionThinking-Model Attribution Dataset Overall
Avg. LDS72
24
Visual AttributionThinking-Model Attribution Dataset Science
LDS74
24
Visual AttributionThinking-Model Attribution Dataset Math
LDS75
24
Visual AttributionThinking-Model Attribution Dataset Code
LDS71
24
SegmentationImageNet-Seg (test)
mIoU27
5
Showing 9 of 9 rows

Other info

Follow for update