Share your thoughts, 1 month free Claude Pro on usSee more

General Task Performance on Macro-average (Mathematics, Multi-Hop QA, Code Generation)

69Accuracy

Claude-Sonnet-4-5

Updated 1mo ago

Evaluation Results

Method	Links
Claude-Sonnet-4-5 2026.01		69	91.4
Qwen3-235B-A22B-Instruct 2026.01		67.5	88.5
GPT-4.1-MINI 2026.01		65.4	76.3
Deepseek-V3.1 2026.01		64.6	87.4
Gemini-2.5-Flash 2026.01		63.9	81.4
DeepSeek-R1-Distill-Qwen-32B 2026.01		61.9	82.3
Grok-4-Fast 2026.01		60.2	76.7
DeepSeek-R1-Distill-Qwen-14B 2026.01		60	79.1
Qwen2.5-72B-Instruct 2026.01		59.9	78.4
QwQ-32B 2026.01		59.1	71.5
Qwen2.5-32B-Instruct 2026.01		57.5	75
Qwen2.5-14B-Instruct 2026.01		56.1	67.9
Qwen3-32B 2026.01		56.1	81.1
Meta-Llama-3.1-70B-Instruct 2026.01		55.9	77.7
GLM-Z1-32B 2026.01		54.8	64.9
Qwen2.5-7B-Instruct 2026.01		50.5	65.4
DeepSeek-R1-Distill-Qwen-7B 2026.01		48.4	75.1
DeepSeek-R1-Distill-Llama-8B 2026.01		47.7	70.3
OpenReasoning-Nemotron-14B 2026.01		45.9	72.7
Meta-Llama-3.1-8B-Instruct 2026.01		40.5	70.8
Qwen2.5-1.5B-Instruct 2026.01		27.4	64.1