Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Grad-ELLM: Gradient-based Explanations for Decoder-only LLMs

About

Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse tasks, yet their black-box nature raises concerns about transparency and faithfulness. Input attribution methods aim to highlight each input token's contributions to the model's output, but existing approaches are typically model-agnostic, and do not focus on transformer-specific architectures, leading to limited faithfulness. To address this, we propose Grad-ELLM, a gradient-based attribution method for decoder-only transformer-based LLMs. By aggregating channel importance from gradients of the output logit with respect to attention layers and spatial importance from attention maps, Grad-ELLM generates heatmaps at each generation step without requiring architectural modifications. Additionally, we introduce two faithfulneses metrics $\pi$-Soft-NC and $\pi$-Soft-NS, which are modifications of Soft-NC/NS that provide fairer comparisons by controlling the amount of information kept when perturbing the text. We evaluate Grad-ELLM on sentiment classification, question answering, and open-generation tasks using different models. Experiment results show that Grad-ELLM consistently achieves superior faithfulness than other attribution methods.

Xin Huang, Antoni B. Chan• 2026

Related benchmarks

TaskDatasetResultRank
Faithfulness EvaluationIMDB
AUC π-Soft-NS57.2
27
Faithfulness EvaluationSST2
AUC π-Soft (NS)0.563
27
Faithfulness EvaluationBoolQ
AUC π-Soft-NS37
27
Faithfulness EvaluationTellMeWhy
AUC π-Soft-NS0.368
27
Faithfulness EvaluationWikiBio
AUC π-Soft-NS0.438
27
Sentiment ClassificationSST2
Deletion Robustness0.3279
20
Sentiment ClassificationIMDB
Deletion Rate22.37
20
Showing 7 of 7 rows

Other info

Follow for update