When Human Preferences Flip: An Instance-Dependent Robust Loss for RLHF

About

Quality of datasets plays an important role in large language model (LLM) alignment. In collecting human feedback, however, preference flipping is ubiquitous and causes corruption in data annotation; the issue necessitates the alignment algorithms with improved robustness against potential flipped pairs. To this end, this paper introduces a Flipping-Aware Direct Preference Optimization (FA-DPO) algorithm tailored to preference flipping from a reinforcement learning with human feedback (RLHF) perspective. We dissect the inherent human intention model and the preference flipping mechanism introduced by external factors as two distinct stages; in the latter, we introduce an instance-dependent flipping probability on the basis of the Bradley-Terry (BT) model. Further, by leveraging features relevant to preference annotation, we capture uncertainty in judgments and model preference flipping patterns. In practice, we design a simple yet efficient iterative optimization algorithm compatible with the original RLHF and DPO algorithms. In our experiments, we investigate the instance-dependent preference flipping model under multiple circumstances for evaluation of our proposed method, as well as other baseline methods.

Yifan Xu, Xichen Ye, Yifan Chen, Qiaosheng Zhang• 2025

Related benchmarks

Task	Dataset	Result
Discriminative Performance	Ultrafeedback 61.1k (test)	Accuracy73.05	30
Discriminative Performance	HH_Golden 42.5k (test)	Accuracy99.61	30
Generative Performance	Ultrafeedback 61.1k (test)	Win Rate69.8	30
Generative Performance	HH Golden 42.5k (test)	Win Rate88.9	30
Preference Alignment	Ultrafeedback 20% flipping ratio	Accuracy78.8	12
Preference Alignment	Ultrafeedback 40% flipping ratio	Accuracy78.87	12

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord