Aligning What LLMs Do and Say: Towards Self-Consistent Explanations
About
Large language models (LLMs) seem to offer an easy path to interpretability: just ask them to explain their answers. Yet the features driving an answer often differ from those emphasized in its explanation, meaning post-hoc rationales can misrepresent what actually shaped the model's output. We quantify this gap by comparing the feature-importance distributions of answers and their explanations. Prior analyses reveal such discrepancies, but large-scale study has been limited by the high computational cost of attribution methods. To address this, we introduce the Post-hoc Self-Consistency Bank (PSCB), a large-scale benchmark linking model decisions with diverse explanations and attribution vectors across datasets, methods, and model families. Using PSCB, we find that Spearman rank correlation provides a more reliable signal of alignment than cosine similarity. Building on this insight, we apply Direct Preference Optimization (DPO) to attribution-based preference data, improving alignment without degrading task accuracy, and show that standard supervised fine-tuning on the same data fails to achieve comparable gains. These improvements generalize robustly across domains, paving the way toward scalable and faithful alignment between LLM decisions and their natural language explanations.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Question Answering | ECQA | Accuracy70.34 | 12 | |
| Question Answering | ARC Easy | Accuracy (ARC-E)87.72 | 12 | |
| Explanation self-consistency | ECQA (test) | Accuracy71.11 | 4 | |
| Explanation self-consistency | ARC-e (test) | Accuracy87.01 | 4 | |
| Explanation self-consistency | ARC-C (test) | Accuracy77.57 | 4 | |
| Explanation self-consistency | CODAH (test) | Accuracy83.39 | 3 |