Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

DARC: Disagreement-Aware Alignment via Risk-Constrained Decoding

About

Preference-based alignment methods (e.g., RLHF, DPO) typically optimize a single scalar objective, implicitly averaging over heterogeneous human preferences. In practice, systematic annotator and user-group disagreement makes mean-reward maximization brittle and susceptible to proxy over-optimization. We propose **Disagreement-Aware Alignment via Risk-Constrained Decoding (DARC)**, a retraining-free inference-time method that frames response selection as distributionally robust, risk-sensitive decision making. Given multiple preference samples or scalable disagreement proxies, DARC reranks candidates by maximizing a *KL-robust (entropic)* satisfaction objective, and provides simple deployment controls that cap or penalize the corresponding entropic risk premium relative to the mean, enabling explicit risk budgets without retraining. We provide theoretical characterization linking this decoding rule to principled pessimism and KL-based distributionally robust optimization. Experiments on alignment benchmarks show that DARC reduces disagreement and tail risk while maintaining competitive average quality under noisy, heterogeneous feedback.

Mingxi Zou, Jiaxiang Chen, Junfan Li, Langzhang Liang, Qifan Wang, Xu Yinghui, Zenglin Xu• 2026

Related benchmarks

TaskDatasetResultRank
Multi-turn Instruction FollowingMT-Bench--
44
Instruction FollowingAlpacaEval 2.0 (Overall)
Reward11.58
26
Instruction FollowingAlpacaEval High-Variance (Top 20%) 2.0
Reward Score11.56
26
Multi-turn Instruction FollowingMT-Bench High-Variance (Top 20%)
Reward Score7.51
26
Chatbot EvaluationMT-Bench High-Disagreement (Top 20%)
Human Score8.72
13
Chatbot EvaluationMT-Bench Overall
Human Score8.15
13
LLM Alignment EvaluationQwen2.5-14B-Instruct Overall
Reward (Avg μ)6.18
6
LLM Alignment EvaluationQwen2.5-14B-Instruct High-Variance (Top 20%)
Average Reward (μ)5.49
6
Showing 8 of 8 rows

Other info

Follow for update