Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

LAMP: Extracting Local Decision Surfaces From Large Language Models

About

We introduce LAMP (Local Attribution Mapping Probe), a method that shines light onto a black-box language model's decision surface and studies how reliably a model maps its stated reasons to its reported predictions by approximating a decision surface. LAMP treats the model's own self-reported explanations as a coordinate system and fits a locally linear surrogate that links those weights to the model's output. By doing so, it reveals how much the stated factors steer the model's decisions. We apply LAMP to three tasks: sentiment analysis, controversial-topic detection, and safety-prompt auditing. Across these tasks, LAMP reveals that many language models' locally approximated linear decision landscapes overall agree with human judgments on explanation quality and, on a clinical case-file data set, align with expert assessments. Since LAMP operates without requiring access to model gradients, logits, or internal activations, it serves as a practical and lightweight framework for auditing proprietary language models, and enabling assessment of whether a model appears to behave consistently with the explanations it provides.

Ryan Chen, Youngmin Ko, Zeyu Zhang, Catherine Cho, Sunny Chung, Mauro Giuffr\'e, Dennis L. Shung, Bradly C. Stadie• 2025

Related benchmarks

TaskDatasetResultRank
Predicting Language Model OutputIMDB
Brier Score0.0089
28
Predicting Language Model OutputPH
Brier Score0.0073
28
Predicting Language Model OutputHateBS
Brier Score0.0043
28
Showing 3 of 3 rows

Other info

Follow for update