| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| Sycophancy Evaluation Dataset | mistral:7b | Total Sycophancy Score0.123 | 32 | 1mo ago | |
| Sycophancy Evaluation Opinion | Mistral-7B | PSS8.14 | 14 | 1mo ago | |
| Sycophancy Evaluation Factual | Llama-3 | PSS0.1124 | 14 | 1mo ago | |
| Beacon benchmark | A/B Accuracy96 | 12 | 15d ago | ||
| PHIL | Supervised Pinpoint Tuning | Sycophancy Preference99.34 | 10 | 3mo ago | |
| POLI | Ours Resid | Sycophantic Preference (%)92.18 | 10 | 3mo ago | |
| NLP | Synthetic Data Intervention | Sycophancy Preference49.25 | 10 | 3mo ago | |
| Open-Ended Sycophancy | Synthetic Data Intervention | Syc Score48.15 | 10 | 3mo ago | |
| Syco-Bench | Pickside Score1.21 | 10 | 3mo ago | ||
| Sycophancy Evaluation | BRR13.3 | 9 | 12d ago | ||
| VISE | Gemini-1.5-Pro | Strong Bias58.04 | 9 | 1mo ago | |
| SycophancyEval | Lag-DPO | Sycophancy Rate54.2 | 9 | 2mo ago | |
| DebateQA | S (PD, L)0.481 | 6 | 1mo ago | ||
| AITA | Sycophancy Score (S) PD-L0.54 | 6 | 1mo ago | ||
| VISE 1.0 (test) | Strong Bias64.84 | 3 | 1mo ago | ||
| TruthfulQA (adversarial) | Silicon Mirror | Sycophantic Response Count1 | 3 | 2mo ago | |
| TruthfulQA Adversarial n=50 | Gemini 2.5 Flash (Static Guardrails) | Sycophantic Responses Count2 | 3 | 2mo ago | |
| Offline Evaluation Set | gpt-5-thinking | Sycophancy Prevalence Score4 | 3 | 3mo ago | |
| Early A/B tests Online prevalence | gpt-5-main | Prevalence Change (Free Users)-0.69 | 1 | 3mo ago |