SOTA LLM Alignment Evaluation benchmarks and papers with code

Benchmarks

Dataset Name	SOTA Method	Metric
AlpacaEval 2	SDPO	LC Win Rate51.9	89	1mo ago
Arena-Hard	AAO	Win Rate42.7	73	3mo ago
AlpacaEval 2.0 (test)	OTPO	LC Win Rate30.35	51	4mo ago
Arena-Hard v0.1	BASE (GPT-4-0314)	Win Rate50	31	1mo ago
Qwen2.5-14B-Instruct High-Variance (Top 20%)	Base (Best-of-K)	Average Reward (μ)5.67	6	4mo ago
Qwen2.5-14B-Instruct Overall	Base (Best-of-K)	Reward (Avg μ)6.31	6	4mo ago
Human Evaluation	Hard-Pair-GRPO	Coherence Score4.42	5	2mo ago

Showing 7 of 7 rows