Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Scientific Reasoning on GPQA Diamond (Acc, Tok, CR, Cost)

65.2Accuracy

Vanilla

32.02440.63749.2557.863Apr 6, 2026
Updated 11d ago

Evaluation Results

MethodLinks
2026.04
65.28,4471008,447
2026.04
65.28,09895.98,433
2026.04
58.36,35575.26,533
2026.04
563,76344.53,799
2026.04
55.89,5361009,536
2026.04
55.89,42498.89,825
2026.04
54.16,8657210,371
2026.04
53.83,128373,159
2026.04
53.27,3181007,318
2026.04
53.27,243997,558
2026.04
536,41267.26,496
2026.04
52.86,06463.66,144
2026.04
52.86,8049311,142
2026.04
51.75,88380.45,967
2026.04
50.95,634775,715
2026.04
496,45967.76,670
2026.04
496,33486.66,535
2026.04
47.59,05698.718,243
2026.04
47.49,1771009,177
2026.04
47.49,1781009,527
2026.04
478,03287.58,468
2026.04
46.97,89886.18,318
2026.04
45.77,94886.68,375
2026.04
44.78309.81,060
2026.04
39.19479.91,210
2026.04
33.91,23116.81,656
2026.04
33.32,62928.64,394