Our new X account is live! Follow @wizwand_team for updates
Home
/
Benchmarks
Multi-task Language Understanding on MMLU-Redux (unseen categories)
Loading...
72
Accuracy
AFlow
55.36
59.68
64
68.32
Feb 23, 2026
Accuracy
Updated 4d ago
Evaluation Results
Method
Method
Links
Accuracy
AFlow
Base LLM=GPT-5-Mini
2026.02
72
GDesigner
Base LLM=GPT-5-Mini
2026.02
72
HieraMAS
Base LLM=Qwen3-80B
2026.02
72
SC+CoT
Base LLM=GPT-5-Mini
2026.02
68
AFlow
Base LLM=Qwen3-80B
2026.02
68
GDesigner
Base LLM=Qwen3-80B
2026.02
68
MASRouter
Base LLM=Qwen3-80B
2026.02
68
HieraMAS
Base LLM=GPT-5-Mini
2026.02
68
SC+CoT
Base LLM=Qwen3-80B
2026.02
64
MASRouter
Base LLM=GPT-5-Mini
2026.02
56
Feedback
Search any
task
Search any
task