Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

AlpacaEval

Benchmarks

Task NameDataset NameSOTA ResultTrend
Instruction FollowingAlpacaEval 2.0
Win Rate95.87
722
Instruction FollowingAlpacaEval
Win Rate98.4
420
Instruction FollowingAlpacaEval 2
LC (%)75.4
137
Instruction FollowingAlpacaEval 2.0 (test)
LC Win Rate (%)67.45
95
LLM alignment evaluationAlpacaEval 2
LC Win Rate51.9
89
Instruction FollowingAlpacaEval (test)
Helpfulness Score3,213
65
ChatAlpacaEval 2.0 (test)
AlpacaEval (LC win %)57.46
58
Instruction Following and Helpfulness EvaluationAlpacaEval 2.0
Win Rate49.4
58
LLM Alignment EvaluationAlpacaEval 2.0 (test)
LC Win Rate30.35
51
Instruction FollowingAlpacaEval LC 2
Win Rate80.9
49
Preference EvaluationAlpacaEval 2
WR (%)559
48
Open-ended GenerationAlpacaEval 2.0
Win Rate648
43
Open-endedAlpacaEval
Win Rate vs Davinci-00393.5
40
ChatAlpacaEval
Win Rate3,213
39
Pairwise evaluationAlpacaEval
Human Agreement72.4
37
DialogueAlpacaEval 2
AlpacaEval2 Score64.2
34
Instruction FollowingAlpacaEval Length-controlled
Score73.9
34
Predictive LLM RoutingAlpacaEval
Score (vs OpenAI)63.17
26
Instruction followingAlpacaEval High-Variance (Top 20%) 2.0
Reward Score11.6
26
Instruction followingAlpacaEval 2.0 (Overall)
Reward11.62
26
LLM AlignmentAlpacaEval 2.0
LC Win Rate61.52
25
General PerformanceAlpacaEval
Winrate98
25
Safety GuardrailingAlpacaEval
False Positive Rate0
24
LLM AlignmentAlpacaEval
Win Rate25.24
24
Chat EvaluationAlpacaEval LC 2
Score74.11
23
Showing 25 of 100 rows