SOTA General Capability benchmarks and papers with code

Benchmarks

Dataset Name	SOTA Method	Metric
MMLU		MMLU Accuracy79.6	74	2mo ago
Aggregate (GPQA-D, GSM8K, HumanEval, MATH-500, MBPP, MMLU-Pro)	FOREVER	Average Accuracy75.9	66	2mo ago
MTBench	REPBEND	MTBench Score9.14	43	5mo ago
OBQA (test)		Normalized Accuracy44	42	1mo ago
BBH, GSM8K, MMLU, TruthfulQA, HumanEval, MBPP	ADG	Average Score26.77	30	3mo ago
All Benchmarks Overall	UltraMix	Overall Average Score52.04	29	3mo ago
8 capability benchmarks Aggregate		Average Capability67.14	26	5mo ago
GPQA Diamond	In-context distillation	Accuracy52	14	23d ago
OLMES benchmarks		Average Score51.4	9	2mo ago
Aggregated Suite 7-metric average (test)	G-Zero	Average Score43.9	8	2mo ago
Capability Evaluation Suite General Pool	C4 (random)	S_Gen Score91.61	4	1mo ago
MMLU-Pro OpenR1-Math Harder		Accuracy71.3	3	5mo ago
GPQA-diamond OpenR1-Math Harder Subset		Accuracy54	3	5mo ago
ARC-c OpenR1-Math Harder	RePO	Accuracy70.6	3	5mo ago

Showing 14 of 14 rows