General Language Modeling

Benchmarks

Dataset Name	SOTA Method	Metric
MMLU, ARC-Challenge, and CommonsenseQA Aggregate	RAISE	Average Score64.77	24	4mo ago
LiveBench	Qwen2.5-7B	Accuracy31.1	17	1mo ago
CodaSet OOD Average (test)	Qwen3-235B	Performance (%)87.84	16	2mo ago
BIG-Bench	TALE	Accuracy85.6	12	2mo ago
General Benchmarks Llama 3.1 8B		Generation Quality Score66.5	11	4mo ago
Combined Suite (HS, PIQA, SIQA, Wino, MMLU, NQ, TQA, ARC-C, ARC-E, OBQA, BoolQ, DROP, BBH-LB, GSM8K)	MobileMoE-L	Accuracy57.8	4	1mo ago
Overall Evaluation Suite	Qwen3-30B-A3B-Instruct-2507	Average Score73.6	4	4mo ago
BIG-Bench (test)	Best Model	Accuracy83.6	2	2mo ago
Linguistic Task		Comprehensive Score88.9	2	3mo ago

Showing 9 of 9 rows