Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

OlmoBaseEval

Benchmarks

Task NameDataset NameSOTA ResultTrend
Multiple Choice Non-STEM Question AnsweringOlmoBaseEval MC Non-STEM (MMLU Humanities/Social Sci, CSQA, PiQA, SocialIQA, CoQA, DROP, Jeopardy, NaturalQs, SQuAD)
Aggregate Score89.3
34
Code GenerationOlmoBaseEval Code BigCodeBench, HumanEval, DeepSeek LeetCode, DS 1000, MBPP, MultiPL
OlmoBaseEval Code Score54.9
34
Mathematical ReasoningOlmoBaseEval Math (GSM8k, GSM Symbolic, MATH)
Math Aggregate Score68.5
34
General Capability Evaluation (Held-out Benchmarks)OlmoBaseEval LBPP BBH MMLU Pro MC Deepmind Math (HeldOut)
LBPP Score42.6
24
Multiple Choice STEM Question AnsweringOlmoBaseEval MCSTEM
MCSTEM Score83.4
22
General Language Model EvaluationOlmoBaseEval HeldOut (LBPP, BBH, MMLU Pro, etc.)
LBPP Score33.7
10
Showing 6 of 6 rows