SOTA General Language Model Evaluation benchmarks and papers with code

Benchmarks

Dataset Name	SOTA Method	Metric
Utility Set MMLU, BBH, TruthfulQA, TriviaQA, AlpacaEval	ELUDe	MMLU68.93	34	3mo ago
Aggregated 11-benchmark suite Math, Code, IF	Qwen3-30B-A3B	Average Accuracy74.9	21	2mo ago
Comprehensive Evaluation Suite	CARE-RL	Overall Average Score50.7	14	1mo ago
Average of 10 tasks	T-SPIN	Overall Performance45.02	12	4mo ago
OlmoBaseEval HeldOut (LBPP, BBH, MMLU Pro, etc.)	Nemo. 3 Nano	LBPP Score33.7	10	3mo ago
Arena-Hard V2.0	RM-NLHF	Win Rate7.03	9	4mo ago
WildBench	PUGC	WildBench Score26.95	2	4mo ago

Showing 7 of 7 rows