Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Large Language Model Evaluation on AlpacaEval, TruthfulQA, and MMLU (test)
Loading...
78.1
AlpacaEval Score
Warmup-Stable-Only (WSO)
75.396
76.098
76.8
77.502
Mar 17, 2026
AlpacaEval Score
TruthfulQA Score
MMLU Score
Average Score
Updated 1mo ago
Evaluation Results
Method
Method
Links
AlpacaEval Score
TruthfulQA Score
MMLU Score
Average Score
Warmup-Stable-Only (WSO)
Model=1B, Scheduler=Wa...
2026.03
78.1
38.7
34.5
50.4
WSD
Model=1B, Scheduler=WS...
2026.03
77.2
38.3
33.6
49.7
Cosine
Model=1B, Scheduler=Co...
2026.03
76.4
37.9
33.9
49.4
WSD
Model=1B, Scheduler=WS...
2026.03
76
38.4
33.7
49.4
Cosine
Model=1B, Scheduler=Co...
2026.03
76
37.9
33.9
49.3
Linear
Model=1B, Scheduler=Li...
2026.03
75.6
37.8
34.2
49.2
Linear
Model=1B, Scheduler=Li...
2026.03
75.5
37.9
33.9
49.1
Feedback
Search any
task
Search any
task