Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Pluralistic Leaderboards

About

Recent leaderboard-based evaluations of large language models aggregate user feedback by fitting a Bradley--Terry model to pairwise comparisons, producing a single global ranking based on a latent quality score. While appealing for its simplicity, this approach is incompatible with heterogeneous preferences: when LLMs are used across diverse tasks and use cases, users who favor fundamentally different model behaviors can be systematically misrepresented when collapsed into a single quality score. To address this issue, we study \emph{pluralistic leaderboards} that aim to remain \emph{stable} with respect to heterogeneous user populations. Drawing on ideas from social choice theory, we adapt the notion of \emph{local stability}, which requires that no model outside the top-$k$ positions is collectively preferred to the top-$k$ set by more than $O(1/k)$ fraction of users. Building on techniques from the social choice literature, we design an alternative leaderboard mechanism that satisfies local stability while eliciting only $\widetilde{O}(k)$ pairwise comparisons per user, where $k$ is the size of the prefix for which stability is guaranteed. Using data from LMArena, we show that standard Bradley--Terry aggregation can violate local stability in practice, whereas our method provides substantially stronger stability guarantees.

Nika Haghtalab, Ariel D. Procaccia, Han Shao, Serena Lutong Wang, Kunhe Yang• 2026

Related benchmarks

TaskDatasetResultRank
LLM RankingLMArena-based simulation
Top-8 Ranked Candidates-2.5
9
Showing 1 of 1 rows

Other info

Follow for update