Our new X account is live! Follow @wizwand_team for updates
Home
/
Benchmarks
Out-of-Domain Reasoning on GPQA
Loading...
65.43
Avg@8 Accuracy
AFlow
28.3748
37.9949
47.615
57.2351
Jan 21, 2026
Avg@8 Accuracy
Updated 4d ago
Evaluation Results
Method
Method
Links
Avg@8 Accuracy
AFlow
Orchestration Type=Inf...
2026.01
65.43
MAS-Orchestra
Orchestration Type=Tra...
2026.01
65.21
DebateAgent
Orchestration Type=Sta...
2026.01
64.14
MAS-GPT
Orchestration Type=Pub...
2026.01
63.51
SCAgent
Orchestration Type=Sta...
2026.01
62.88
ReflexionAgent
Orchestration Type=Sta...
2026.01
62.37
CoTAgent
Orchestration Type=Sta...
2026.01
60.54
MaAS
Orchestration Type=Inf...
2026.01
40.78
ToolOrchestra
Orchestration Type=Pub...
2026.01
29.8
Feedback
Search any
task
Search any
task