| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| GSM8K, MATH, HumanEval, MBPP, FinanceBench, ConvFinQA, PubMedQA, and MedQA USMLE | Pico | Math Accuracy30.65 | 24 | 1mo ago | |
| MMLU, GSM8K, MATH, MedQA, MedMCQA | Math Accuracy56.9 | 15 | 2mo ago | ||
| Aggregate HealthBench, LLMMed-Eval, WritingBench, Creative Writing, ResearchQA | EvoRubric | Macro-average Score70.55 | 10 | 5d ago | |
| GRAFITE Sample Dataset (Total) | Llama-4-Maverick-17B-128E-Instruct | Pass Rate63.2 | 4 | 2mo ago |