Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Multiple-choice Reasoning on GPQA full dataset

66.29Accuracy

Meta-Debate

43.77449.619555.46561.3105Jan 23, 2026
Updated 4d ago

Evaluation Results

MethodLinks
2026.01
66.29
2026.01
60.27
2026.01
59.15
2026.01
58.93
2026.01
58.26
2026.01
55.58
2026.01
54.46
2026.01
54.24
2026.01
53.57
2026.01
52.46
2026.01
52.23
2026.01
50.67
2026.01
50.45
2026.01
44.64