Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Faithfulness Evaluation for Decoder-only LLM Attributions with Controlled Retained Information

About

Large Language Models (LLMs) are increasingly evaluated with input attribution methods, yet comparing such explanations remains challenging. Existing soft-perturbation faithfulness metrics, such as Soft-NC and Soft-NS, can conflate attribution quality with the number of words retained during perturbation: attribution methods with larger average scores may keep more words and therefore obtain inflated scores. To address this issue, we propose $\pi$-Soft-NC and $\pi$-Soft-NS, an evaluation framework that compares attribution methods under the same expected retaining probability, thus controlling the number of retained words. We further introduce Grad-ELLM, a gradient-based attribution method tailored to autoregressive decoder-only LLMs, which combines gradient-derived channel importance with attention-derived token importance at each decoding step. Experiments on classification and open-generation tasks with Llama and Mistral show that Grad-ELLM achieves strong comprehensiveness-oriented faithfulness under $\pi$-Soft-NC, while there is no dominant method under $\pi$-Soft-NS. Our evaluation metric serves as a rigorous framework to compare XAI methods for LLMs, which will support progress in the field.

Xin Huang, Antoni B. Chan• 2026

Related benchmarks

TaskDatasetResultRank
Faithfulness EvaluationTellMeWhy
AUC π-Soft-NS0.368
67
Faithfulness EvaluationWikiBio
AUC π-Soft-NS0.438
67
Faithfulness EvaluationIMDB
AUC π-Soft-NS57.2
27
Faithfulness EvaluationSST2
AUC π-Soft (NS)0.563
27
Faithfulness EvaluationBoolQ
AUC π-Soft-NS37
27
Sentiment ClassificationSST2
Deletion Robustness0.3279
20
Sentiment ClassificationIMDB
Deletion Rate22.37
20
Showing 7 of 7 rows

Other info

Follow for update