Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Multiple-choice Reasoning on GPQA full dataset

66.29Accuracy

Meta-Debate

20.103632.094344.08556.0757Jan 23, 2026Feb 1, 2026Feb 11, 2026Feb 21, 2026Mar 3, 2026Mar 13, 2026Mar 23, 2026
Updated 24d ago

Evaluation Results

MethodLinks
2026.01
66.29
2026.01
60.27
2026.01
59.15
2026.01
58.93
2026.01
58.26
2026.01
55.58
2026.01
54.46
2026.01
54.24
2026.01
53.57
2026.01
52.46
2026.01
52.23
2026.01
50.67
2026.01
50.45
2026.01
44.64
2026.03
38.8
2026.03
34.2
2026.03
34.15
2026.03
33.93
2026.03
30.13
2026.03
21.88