| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Multiple Choice Non-STEM Question Answering | OlmoBaseEval MC Non-STEM (MMLU Humanities/Social Sci, CSQA, PiQA, SocialIQA, CoQA, DROP, Jeopardy, NaturalQs, SQuAD) | Aggregate Score89.3 | 34 | |
| Code Generation | OlmoBaseEval Code BigCodeBench, HumanEval, DeepSeek LeetCode, DS 1000, MBPP, MultiPL | OlmoBaseEval Code Score54.9 | 34 | |
| Mathematical Reasoning | OlmoBaseEval Math (GSM8k, GSM Symbolic, MATH) | Math Aggregate Score68.5 | 34 | |
| General Capability Evaluation (Held-out Benchmarks) | OlmoBaseEval LBPP BBH MMLU Pro MC Deepmind Math (HeldOut) | LBPP Score42.6 | 24 | |
| Multiple Choice STEM Question Answering | OlmoBaseEval MCSTEM | MCSTEM Score83.4 | 22 | |
| General Language Model Evaluation | OlmoBaseEval HeldOut (LBPP, BBH, MMLU Pro, etc.) | LBPP Score33.7 | 10 |