Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Attributions All the Way Down? The Metagame of Interpretability

About

We introduce the metagame, a conceptual framework for quantifying second-order interaction effects of model explanations. For any first-order attribution $\phi(f)$ explaining a model $f$, we measure the directional influence of feature $j$ on the attribution of feature $i$, denoted as meta-attribution $\varphi_{j \to i}(f)$, by treating the attribution method itself as a cooperative game and computing its Shapley value. Theoretically, we prove that attributions hierarchically decompose into meta-attributions, and establish these as directional extensions of existing interaction indices. Empirically, we demonstrate that the metagame delivers insights across diverse interpretability applications: (i) quantifying token interactions in instruction-tuned language models, (ii) explaining cross-modal similarity in vision-language encoders, and (iii) interpreting text-to-image concepts in multimodal diffusion transformers.

Hubert Baniecki, Przemyslaw Biecek, Fabian Fumagalli• 2026

Related benchmarks

TaskDatasetResultRank
Interaction RecognitionImageNet-1k 2 objects
Interaction Recognition88
26
Interaction RecognitionImageNet 1k (3 objects)
Interaction Recognition Accuracy90
26
Interaction RecognitionImageNet-1k 4 objects
Interaction Recognition Accuracy91
26
Interaction RecognitionImageNet-1k 1 object
Interaction Recognition89
26
Image SegmentationPascal VOC
Accuracy90.3
4
Image SegmentationMS-COCO
Accuracy88.9
4
Showing 6 of 6 rows

Other info

Follow for update