Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

AlpacaEval

Benchmarks

Task NameDataset NameSOTA ResultTrend
Instruction FollowingAlpacaEval 2.0
Win Rate95.87
507
Instruction FollowingAlpacaEval
Win Rate97.2
227
LLM alignment evaluationAlpacaEval 2
LC Win Rate51.9
86
Instruction FollowingAlpacaEval 2.0 (test)
LC Win Rate (%)67.45
81
Instruction Following and Helpfulness EvaluationAlpacaEval 2.0
Win Rate49.4
58
LLM Alignment EvaluationAlpacaEval 2.0 (test)
LC Win Rate30.35
51
ChatAlpacaEval 2.0 (test)
AlpacaEval (LC win %)57.46
46
Open-ended GenerationAlpacaEval 2.0
Win Rate648
43
Open-endedAlpacaEval
Win Rate vs Davinci-00393.5
40
ChatAlpacaEval
Win Rate3,213
39
Instruction FollowingAlpacaEval (test)
Helpfulness Score3,213
32
Instruction followingAlpacaEval High-Variance (Top 20%) 2.0
Reward Score11.6
26
Instruction followingAlpacaEval 2.0 (Overall)
Reward11.62
26
General PerformanceAlpacaEval
Winrate98
25
Safety GuardrailingAlpacaEval
False Positive Rate0
24
Chat EvaluationAlpacaEval LC 2
Score74.11
23
Open-ended GenerationAlpacaEval 1.0
Win Rate7,904
23
Instruction FollowingAlpacaEval Yoruba
Win Rate (%)68.9
20
Instruction FollowingAlpacaEval Swahili
Win Rate83
20
Instruction FollowingAlpacaEval Indonesian
Win Rate64.2
20
Instruction FollowingAlpacaEval Korean
Win Rate77.8
20
Instruction FollowingAlpacaEval German
Win Rate65.2
20
Instruction FollowingAlpacaEval Chinese
Win Rate70.4
20
LLM EvaluationAlpacaEval
AlpacaE51.06
16
Instruction Following EvaluationAlpacaEval 2
Win Rate48.14
16
Showing 25 of 67 rows