LAMP: Extracting Local Decision Surfaces From Large Language Models

About

We introduce LAMP (Local Attribution Mapping Probe), a method that shines light onto a black-box language model's decision surface and studies how reliably a model maps its stated reasons to its reported predictions by approximating a decision surface. LAMP treats the model's own self-reported explanations as a coordinate system and fits a locally linear surrogate that links those weights to the model's output. By doing so, it reveals how much the stated factors steer the model's decisions. We apply LAMP to three tasks: sentiment analysis, controversial-topic detection, and safety-prompt auditing. Across these tasks, LAMP reveals that many language models' locally approximated linear decision landscapes overall agree with human judgments on explanation quality and, on a clinical case-file data set, align with expert assessments. Since LAMP operates without requiring access to model gradients, logits, or internal activations, it serves as a practical and lightweight framework for auditing proprietary language models, and enabling assessment of whether a model appears to behave consistently with the explanations it provides.

Ryan Chen, Youngmin Ko, Zeyu Zhang, Catherine Cho, Sunny Chung, Mauro Giuffr\'e, Dennis L. Shung, Bradly C. Stadie• 2025

Related benchmarks

Task	Dataset	Result
Predicting Language Model Output	IMDB	Brier Score0.0089	28
Predicting Language Model Output	PH	Brier Score0.0073	28
Predicting Language Model Output	HateBS	Brier Score0.0043	28

Showing 3 of 3 rows

Other info

Follow for update

@wizwand_team Discord