Rubrics as an Attack Surface: Stealthy Preference Drift in LLM Judges

About

Evaluation and alignment pipelines for large language models increasingly rely on LLM-based judges, whose behavior is guided by natural-language rubrics and validated on benchmarks. We identify a previously under-recognized vulnerability in this workflow, which we term Rubric-Induced Preference Drift (RIPD). Even when rubric edits pass benchmark validation, they can still produce systematic and directional shifts in a judge's preferences on target domains. Because rubrics serve as a high-level decision interface, such drift can emerge from seemingly natural, criterion-preserving edits and remain difficult to detect through aggregate benchmark metrics or limited spot-checking. We further show this vulnerability can be exploited through rubric-based preference attacks, in which benchmark-compliant rubric edits steer judgments away from a fixed human or trusted reference on target domains, systematically inducing RIPD and reducing target-domain accuracy up to 9.5% (helpfulness) and 27.9% (harmlessness). When these judgments are used to generate preference labels for downstream post-training, the induced bias propagates through alignment pipelines and becomes internalized in trained policies. This leads to persistent and systematic drift in model behavior. Overall, our findings highlight evaluation rubrics as a sensitive and manipulable control interface, revealing a system-level alignment risk that extends beyond evaluator reliability alone. The code is available at: https://github.com/ZDCSlab/Rubrics-as-an-Attack-Surface. Warning: Certain sections may contain potentially harmful content that may not be appropriate for all readers.

Ruomeng Ding, Yifei Pang, He Sun, Yizhong Wang, Zhiwei Steven Wu, Zhun Deng• 2026

Related benchmarks

Task	Dataset	Result
Preference Classification	Anthropic HH Harmless (test)	Accuracy70.7	22
Harmlessness preference labeling accuracy	SafeRLHF-RMB (test)	Bench Accuracy70.6	15
Helpfulness preference labeling accuracy	Ultra-Creative	Accuracy (Bench)73.5	15
Helpfulness preference labeling accuracy	Ultra-Real	Benchmark Score73.2	15
Judge Accuracy	RMB-SafeRLHF (Bench)	Accuracy85.6	4
Judge Accuracy	RMB-SafeRLHF (Target)	Accuracy67.4	4
Judge Accuracy	Ultra-Problem (Bench)	Accuracy73	2
Preference Evaluation	Ultra-Real benchmark	Win Rate43.1	2
Preference Evaluation	Ultra-Real (target)	Win Rate43	2
Preference Evaluation	Anthropic-SafeRLHF benchmark	Win Rate33.7	2

Showing 10 of 12 rows

Other info

GitHub

Follow for update

@wizwand_team Discord