Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Learning to Attribute with Attention

About

Given a sequence of tokens generated by a language model, we may want to identify the preceding tokens that influence the model to generate this sequence. Performing such token attribution is expensive; a common approach is to ablate preceding tokens and directly measure their effects. To reduce the cost of token attribution, we revisit attention weights as a heuristic for how a language model uses previous tokens. Naive approaches to attribute model behavior with attention (e.g., averaging attention weights across attention heads to estimate a token's influence) have been found to be unreliable. To attain faithful attributions, we propose treating the attention weights of different attention heads as features. This way, we can learn how to effectively leverage attention weights for attribution (using signal from ablations). Our resulting method, Attribution with Attention (AT2), reliably performs on par with approaches that involve many ablations, while being significantly more efficient. To showcase the utility of AT2, we use it to prune less important parts of a provided context in a question answering setting, improving answer quality. We provide code for AT2 at https://github.com/MadryLab/AT2 .

Benjamin Cohen-Wang, Yung-Sung Chuang, Aleksander Madry• 2025

Related benchmarks

TaskDatasetResultRank
Citation AttributabilityTransfer
QA Score70.1
54
Citation ControlCITECONTROL
Re Score100
54
Knowledge corruption tracebackNQ
Precision78
30
Knowledge corruption tracebackMS Marco
Precision75
26
Traceback (Prompt Injection Attacks)MuSiQue
Precision (MuSiQue Traceback)87
23
Knowledge corruption tracebackHotpotQA
Precision0.66
16
Traceback (Prompt Injection Attacks)QMSum
Precision0.84
13
Traceback (Prompt Injection Attacks)NarrativeQA
Precision64
13
Context TracebackNarrativeQA LongBench
Precision86
10
Context TracebackQMSum LongBench
Precision95
10
Showing 10 of 11 rows

Other info

Follow for update