Back to Blackwell: Closing the Loop on Intransitivity in Multi-Objective Preference Fine-Tuning

About

A recurring challenge in preference fine-tuning (PFT) is handling $\textit{intransitive}$ (i.e., cyclic) preferences. Intransitive preferences often stem from either $\textit{(i)}$ inconsistent rankings along a single objective or $\textit{(ii)}$ scalarizing multiple objectives into a single metric. Regardless of their source, the downstream implication of intransitive preferences is the same: there is no well-defined optimal policy, breaking a core assumption of the standard PFT pipeline. In response, we propose a novel, game-theoretic solution concept, the $\textit{Maximum Entropy Blackwell Winner}$ ($\textit{MaxEntBW}$), that is well-defined under multi-objective intransitive preferences. To enable computing MaxEntBWs at scale, we derive $\texttt{PROSPER}$: a provably efficient PFT algorithm. Unlike prior self-play techniques, $\texttt{PROSPER}$ directly handles multiple objectives without requiring scalarization. We then apply $\texttt{PROSPER}$ to the problem of fine-tuning large language models (LLMs) from multi-objective LLM-as-a-Judge feedback (e.g., rubric-based judges), a setting where both sources of intransitivity arise. We find that $\texttt{PROSPER}$ outperforms all baselines considered across both instruction following and general chat benchmarks, releasing trained model checkpoints at the 7B and 3B parameter scales.

Jiahao Zhang, Lujing Zhang, Keltin Grimes, Zhuohao Yu, Gokul Swamy, Zhiwei Steven Wu• 2026

Related benchmarks

Task	Dataset	Result
Instruction Following	AlpacaEval 2.0	Win Rate55.4	722
Instruction Following	AlpacaEval LC 2	Win Rate38.21	49
General chat	Arena-Hard Vanilla	Win Rate0.492	5
General chat	Arena-Hard Style-Controlled	Win-rate46.1	5

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord