Aligning What LLMs Do and Say: Towards Self-Consistent Explanations

About

Large language models (LLMs) seem to offer an easy path to interpretability: just ask them to explain their answers. Yet the features driving an answer often differ from those emphasized in its explanation, meaning post-hoc rationales can misrepresent what actually shaped the model's output. We quantify this gap by comparing the feature-importance distributions of answers and their explanations. Prior analyses reveal such discrepancies, but large-scale study has been limited by the high computational cost of attribution methods. To address this, we introduce the Post-hoc Self-Consistency Bank (PSCB), a large-scale benchmark linking model decisions with diverse explanations and attribution vectors across datasets, methods, and model families. Using PSCB, we find that Spearman rank correlation provides a more reliable signal of alignment than cosine similarity. Building on this insight, we apply Direct Preference Optimization (DPO) to attribution-based preference data, improving alignment without degrading task accuracy, and show that standard supervised fine-tuning on the same data fails to achieve comparable gains. These improvements generalize robustly across domains, paving the way toward scalable and faithful alignment between LLM decisions and their natural language explanations.

Sahar Admoni, Ofra Amir, Assaf Hallak, Yftah Ziser• 2025

Related benchmarks

Task	Dataset	Result
Question Answering	ARC Easy	Accuracy (ARC-E)87.72	15
Question Answering	ECQA	Accuracy70.34	12
Explanation self-consistency	ECQA (test)	Accuracy71.11	4
Explanation self-consistency	ARC-e (test)	Accuracy87.01	4
Explanation self-consistency	ARC-C (test)	Accuracy77.57	4
Explanation self-consistency	CODAH (test)	Accuracy83.39	3

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord