Language Model Evaluation

Benchmarks

Dataset Name	SOTA Method	Metric
BenchPress short-context (test)	Qwen3-8B	Accuracy68.84	131	2mo ago
Winogrande, ARC-C, ARC-E, Lambada, PIQA, Hellaswag, MMLU, IFEval, and GSM8K-CoT (Mixed standard 10-shot prompt)	IF4	Accuracy80.39	88	2mo ago
Open LLM Leaderboard v2 (test)		BBH60.84	47	2mo ago
ArxivRollBench 2026a		Valid Accuracy70.8	42	2mo ago
Tulu-3 (dev)		Knowledge Score60.85	35	23d ago
Qwen3-0.6B Evaluation Suite average	EDGERAZOR	Average Performance47.8	24	2mo ago
Aggregate		Average Score69.86	21	1mo ago
Pooled tasks Table 5 Llama-3.1 3.3 (various)	Llama-3.3 70B Instruct	Pooled Accuracy Estimate (γ̂)57.15	21	5mo ago
AdaptEval	SCALENET (Layer-wise)	ROUGE-Lsum0.2733	14	5mo ago
LiveBench	CoThinker	Average Score72.8	12	1mo ago
DCLM Core	Composer (2Mb-M-3A)	DCLM Core Score49.3	12	2mo ago
lm-eval-harness (test)		MMLU74.22	9	2mo ago
Downstream Tasks Evaluation Suite Math, Code, Law, Know., Reason., MMLU		Math Accuracy4.92	9	5mo ago
DATE-LM MMLU, GSM8K, BBH		MMLU Accuracy62.07	7	1mo ago
Quality, Factuality, and Safety Evaluation Suite (test)	Self-Improving Pretraining	Generation Quality Score86.3	7	5mo ago
NLP Evaluation Suite (WG, PIQA, BoolQ, ARC-C, ARC-E, OBQA, HS, SciQ, LM, RTE)	QK sharing	WG60.14	6	5mo ago
Open LeaderBoard v2	AgentSociety	AS0.6547	5	2mo ago
1.3B LLM Leaderboard	QA+C4-85B	ARC36.4	5	5mo ago
MMLU-Pro	AgentSociety	AS77.54	1	2mo ago

Showing 19 of 19 rows