ReAGent: A Model-agnostic Feature Attribution Method for Generative Language Models

About

Feature attribution methods (FAs), such as gradients and attention, are widely employed approaches to derive the importance of all input features to the model predictions. Existing work in natural language processing has mostly focused on developing and testing FAs for encoder-only language models (LMs) in classification tasks. However, it is unknown if it is faithful to use these FAs for decoder-only models on text generation, due to the inherent differences between model architectures and task settings respectively. Moreover, previous work has demonstrated that there is no `one-wins-all' FA across models and tasks. This makes the selection of a FA computationally expensive for large LMs since input importance derivation often requires multiple forward and backward passes including gradient computations that might be prohibitive even with access to large compute. To address these issues, we present a model-agnostic FA for generative LMs called Recursive Attribution Generator (ReAGent). Our method updates the token importance distribution in a recursive manner. For each update, we compute the difference in the probability distribution over the vocabulary for predicting the next token between using the original input and using a modified version where a part of the input is replaced with RoBERTa predictions. Our intuition is that replacing an important token in the context should have resulted in a larger change in the model's confidence in predicting the token than replacing an unimportant token. Our method can be universally applied to any generative LM without accessing internal model weights or additional training and fine-tuning, as most other FAs require. We extensively compare the faithfulness of ReAGent with seven popular FAs across six decoder-only LMs of various sizes. The results show that our method consistently provides more faithful token importance distributions.

Zhixue Zhao, Boxuan Shan• 2024

Related benchmarks

Task	Dataset	Result
Faithfulness Evaluation	TellMeWhy	AUC π-Soft-NS0.39	67
Faithfulness Evaluation	WikiBio	AUC π-Soft-NS0.4	67
Attribution Alignment	Curated Attribution Dataset (NarrativeQA + SciQ)	DSA (Dependent Sentence Attribution)3.78	40
Attribution Faithfulness	LongRA	Soft-NC Score1.68	40
Faithfulness Evaluation	Halogen	CODE42	20
Faithfulness Measurement	MHC	BLEU68.4	18
Fact Checking	Causal and Downstream Robustness Ablation Suite Averaged over 4 models	Fact EMΔ2	14
Tool Use	Causal and Downstream Robustness Ablation Suite Averaged over 4 models	Tool Hit@1Δ2.1	14
Causal Attribution	Causal and Downstream Robustness Ablation Suite Averaged over LLaMA-3.1 70B, Phi-3 14B, GPT-J 6B, Qwen2.5 3B	Causal Pass@568	14
Decoding Stability	Causal and Downstream Robustness Ablation Suite Averaged over 4 models	Decoding Δ%2.4	14

Showing 10 of 17 rows

Other info

Follow for update

@wizwand_team Discord