Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Scientific Reasoning on GPQA Diamond (Accuracy)

62.63Accuracy (pass@1)

BF16

20.083631.129342.17553.2207Sep 28, 2025Oct 12, 2025Oct 26, 2025Nov 9, 2025Nov 23, 2025Dec 7, 2025Dec 22, 2025
Updated 4d ago

Evaluation Results

MethodLinks
2025.12
62.63-
2025.12
60.1-
2025.12
60.1-
2025.12
59.6-
2025.12
59.09-
2025.12
58.59-
2025.12
58.08-
2025.12
56.57-
2025.12
56.06-
2025.12
55.56-
2025.12
54.58-
2025.12
54.54-
2025.12
52.53-
2025.12
50-
2025.12
48.99-
2025.12
47.47-
2025.12
46.96-
2025.12
46.46-
2025.12
44.95-
2025.12
42.93-
2025.12
37.37-
2025.09
34.8-
2025.09
31.8-
2025.09
31.3-
2025.12
30.81-
2025.09
30.8-
2025.09
29.3-
2025.09
28.8-
2025.09
28.8-
2025.09
27.3-
2025.09
27.3-
2025.09
27.3-
2025.12
22.22-
2025.12
21.72-
2026.01
-55.1
2026.01
-48.1
2026.01
-51
2026.01
-52
-75.88
-76.78
2025.06
-45.96
2025.06
-56.57
2025.06
-60.61
2026.04
-14.1
2026.04
-28.8
2026.04
-29.3
2026.04
-27.8
2026.04
-44.4