Share your thoughts, 1 month free Claude Pro on usSee more

Comprehensive Reasoning and Knowledge Benchmark (GSM8K, MMLU, HumanEval, ARC, GPQA)

95.41GSM8K

Reasoning

Updated 4mo ago

Evaluation Results

Method	Links
Reasoning 2026.01		95.41	84.4	53.33	89.63	57.25	95.25	99.32	79.84	53.3	78.64
ReasonAny 2026.01		91.13	81	36.67	89.71	53.39	96.88	98.94	82	55.33	76.12
Task Arithmetic 2026.01		89.84	73.6	20	83.54	25.02	95.59	98.77	81.76	36.36	67.16
DARE 2026.01		86.2	81	30	81.32	27.13	96.27	98.59	81.85	29.94	68.03
Safety 2026.01		83.02	82.2	23.33	84.31	23.39	93.48	98.24	81.78	41.67	67.94
FuseLLM 2026.01		81.73	79.8	26.67	71.34	39.85	95.25	98.59	80.8	29.94	67.11
LED 2026.01		69.75	78.4	20	84.31	48.6	95.59	98.41	80.83	52.28	69.8
TIES 2026.01		64.67	71.4	20	88.41	54.23	94.34	96.88	80.65	40.91	67.94
Linear 2026.01		28.28	15.2	10	89.02	24.74	95.93	98.94	81.85	45.18	54.35