BASIL: Bayesian Assessment of Sycophancy in LLMs
About
Sycophancy (overly agreeable or flattering behavior) poses a fundamental challenge for human-AI collaboration, particularly in high-stakes decision-making domains such as health, law, and education. A central difficulty in studying sycophancy in large language models (LLMs) is disentangling sycophantic belief shifts from rational changes in behavior driven by new evidence or user-provided information. Existing approaches either measure descriptive behavior changes or apply normative evaluations that rely on objective ground truth, limiting their applicability to subjective or uncertain tasks. We introduce a Bayesian probabilistic framework, grounded in behavioral economics and rational decision theory, that explicitly separates sycophancy from rational belief updating. Within this framework, we achieve three objectives: (i) a descriptive metric that measures sycophancy while controlling for rational responses to evidence; (ii) a normative metric that quantifies how sycophancy leads models astray from Bayesian-consistent belief updating; and (iii) the ability to apply both metrics in settings without ground-truth labels. Applying our framework across multiple LLMs and three uncertainty-driven tasks, we find robust evidence of sycophantic belief shifts and show that their impact on rationality depends on whether models systematically over- or under-update their beliefs. Finally, we demonstrate that a post-hoc calibration method and two fine-tuning strategies (SFT and DPO) substantially reduce Bayesian inconsistency, with particularly strong improvements under explicit sycophancy prompting.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Sycophancy Assessment | BASIL 1.0 (All) | Change in Bayesian Error (RMSE)-0.046 | 32 | |
| Sycophancy Assessment | BASIL 1.0 (Under-Update) | Change in Bayesian Error (RMSE)-0.234 | 32 | |
| Sycophancy Assessment | BASIL Over-Update 1.0 | Change in Bayesian Error (RMSE)0.061 | 32 | |
| Sycophancy Evaluation | Sycophancy Evaluation Dataset | Total Sycophancy Score0.341 | 32 | |
| Bayesian Assessment of Sycophancy | BASIL Abstract setting 1.0 (test) | -- | 18 | |
| Bayesian Assessment of Sycophancy | BASIL Third-p. belief setting 1.0 (test) | -- | 18 | |
| Bayesian Assessment of Sycophancy | BASIL User belief setting 1.0 (test) | -- | 18 |