Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Multi-task Language Understanding on MMLU-Pro (Accuracy, AVG., Improvement Overhead)
Loading...
72.47
Accuracy
Qwen3-235B-A22B
44.0468
51.4259
58.805
66.1841
Aug 19, 2025
Accuracy
Average Score
Improvement Overhead
Updated 1mo ago
Evaluation Results
Method
Method
Links
Accuracy
Average Score
Improvement Overhead
Qwen3-235B-A22B
Framework=Reference Mo...
2025.08
72.47
78.22
-
COCO Qwen3-8B with coco(Llama-3.1-8B)
Framework=COCO Framewo...
2025.08
68.69
74.37
6.5
COCO Qwen3-8B with coco(Qwen3-8B)
Framework=COCO Framewo...
2025.08
66.6
74.18
6.2
Aflow-Qwen3-8B
Framework=Multi-Agent...
2025.08
66.56
69.86
-
Qwen3-8B
Framework=Reference Mo...
2025.08
58.85
68.52
-
COCO Llama-3.1-8B with coco(Qwen3-8B)
Framework=COCO Framewo...
2025.08
53.44
63.59
9.5
Llama-3.1-8B
Framework=Reference Mo...
2025.08
48.03
55.48
-
COCO Llama-3.1-8B with coco(Llama-3.1-8B)
Framework=COCO Framewo...
2025.08
45.62
58.46
0.63
Aflow-Llama3.1-8B
Framework=Multi-Agent...
2025.08
45.14
58.09
-
Feedback
Search any
task
Search any
task