Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Democratic Preference Alignment via Sortition-Weighted RLHF

About

Whose values should AI systems learn? Preference based alignment methods like RLHF derive their training signal from human raters, yet these rater pools are typically convenience samples that systematically over represent some demographics and under represent others. We introduce Democratic Preference Optimization, or DemPO, a framework that applies algorithmic sortition, the same mechanism used to construct citizen assemblies, to preference based fine tuning. DemPO offers two training schemes. Hard Panel trains exclusively on preferences from a quota satisfying mini public sampled via sortition. Soft Panel retains all data but reweights each rater by their inclusion probability under the sortition lottery. We prove that Soft Panel weighting recovers the expected Hard Panel objective in closed form. Using a public preference dataset that pairs human judgments with rater demographics and a seventy five clause constitution independently elicited from a representative United States panel, we evaluate Llama models from one billion to eight billion parameters fine tuned under each scheme. Across six aggregation methods, the Hard Panel consistently ranks first and the Soft Panel consistently outperforms the unweighted baseline, with effect sizes growing as model capacity increases. These results demonstrate that enforcing demographic representativeness at the preference collection stage, rather than post hoc correction, yields models whose behavior better reflects values elicited from representative publics.

Suvadip Sana, Jinzhou Wu, Martin T. Wells• 2026

Related benchmarks

TaskDatasetResultRank
Preference Alignment EvaluationPRISM (test)
BT Score (Mean)0.331
10
Preference AlignmentPRISM 1.0 (full)
Borda Avg Score2.459
5
Preference Alignment15,000 listwise rankings (test)
BT Score0.384
5
Preference AlignmentPRISM normalized-step (test)
Borda Avg2.328
5
Preference AlignmentPRISM 1.0 (test)
Borda Average2.393
5
Consensus RankingPRISM Llama-3.2-1B--
1
Democratic Preference AlignmentLlama 3.2 3B raw scores (test)--
1
Ranking Consensus15,000 listwise rankings--
1
Showing 8 of 8 rows

Other info

Follow for update