Our new X account is live! Follow @wizwand_team for updates
Home
/
Benchmarks
Multi-agent Reasoning on Reasoning Benchmarks Cooperative AutoGen framework (test)
Loading...
83.58
Overall Accuracy
MARSHAL (Generalist, 8B)
79.524
80.577
81.63
82.683
Oct 17, 2025
Overall Accuracy
MATH Accuracy
GSM8K Accuracy
AQUA Accuracy
AIME Accuracy
AMC Accuracy
MMLU Accuracy
GPQA Accuracy
Updated 4d ago
Evaluation Results
Method
Method
Links
Overall Accuracy
MATH Accuracy
GSM8K Accuracy
AQUA Accuracy
AIME Accuracy
AMC Accuracy
MMLU Accuracy
GPQA Accuracy
MARSHAL (Generalist, 8B)
MAS=AutoGen, Model=MAR...
2025.10
83.58
94.4
95
85.04
70
95
90.04
55.56
Qwen3-8B
MAS=AutoGen, Model=Qwe...
2025.10
79.68
88.8
95.91
83.07
60
89.19
89.3
51.52
Feedback
Search any
task
Search any
task