Democratic Preference Alignment via Sortition-Weighted RLHF

About

Whose values should AI systems learn? Preference based alignment methods like RLHF derive their training signal from human raters, yet these rater pools are typically convenience samples that systematically over represent some demographics and under represent others. We introduce Democratic Preference Optimization, or DemPO, a framework that applies algorithmic sortition, the same mechanism used to construct citizen assemblies, to preference based fine tuning. DemPO offers two training schemes. Hard Panel trains exclusively on preferences from a quota satisfying mini public sampled via sortition. Soft Panel retains all data but reweights each rater by their inclusion probability under the sortition lottery. We prove that Soft Panel weighting recovers the expected Hard Panel objective in closed form. Using a public preference dataset that pairs human judgments with rater demographics and a seventy five clause constitution independently elicited from a representative United States panel, we evaluate Llama models from one billion to eight billion parameters fine tuned under each scheme. Across six aggregation methods, the Hard Panel consistently ranks first and the Soft Panel consistently outperforms the unweighted baseline, with effect sizes growing as model capacity increases. These results demonstrate that enforcing demographic representativeness at the preference collection stage, rather than post hoc correction, yields models whose behavior better reflects values elicited from representative publics.

Suvadip Sana, Jinzhou Wu, Martin T. Wells• 2026

Related benchmarks

Task	Dataset	Result
Preference Alignment Evaluation	PRISM (test)	BT Score (Mean)0.331	10
Preference Alignment	PRISM 1.0 (full)	Borda Avg Score2.459	5
Preference Alignment	15,000 listwise rankings (test)	BT Score0.384	5
Preference Alignment	PRISM normalized-step (test)	Borda Avg2.328	5
Preference Alignment	PRISM 1.0 (test)	Borda Average2.393	5
Consensus Ranking	PRISM Llama-3.2-1B	--	1
Democratic Preference Alignment	Llama 3.2 3B raw scores (test)	--	1
Ranking Consensus	15,000 listwise rankings	--	1

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord