Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Accuracy Evaluation on BBH General Reasoning

94.6BBH General Reasoning Accuracy

GPT-5 high

7.96830.45952.9575.441Oct 3, 2025Oct 30, 2025Nov 27, 2025Dec 25, 2025Jan 22, 2026Feb 19, 2026Mar 19, 2026
Updated 10d ago

Evaluation Results

MethodLinks
2025.11
94.6-
2025.11
93.8-
2026.03
92.07-
2026.03
91.03-
2025.11
90-
2026.03
89.66-
2026.01
88.7-
2026.01
88.7-
2026.01
88.5-
2026.01
88.2-
2026.02
88-
2026.02
86-
2026.02
84-
2026.02
83.5-
2026.02
83-
2026.02
81-
2026.02
81-
2025.11
80.6-
2026.02
79-
2026.03
78.28-
2026.03
77.24-
2026.02
76.5-
2025.10
75.31-
2025.10
75.31-
2025.10
74.43-
2025.10
73.33-
2025.10
72.8-
2026.03
72.76-
2025.10
72.08-
2026.02
72-
2025.11
69.8-
2026.02
69.5-
2026.02
69.5-
2026.02
67.5-
2026.03
65.17-
2026.02
65-
2026.02
65-
2026.03
63.8-
2026.03
63.1-
2025.10
62.42-
2026.02
61-
2026.03
60.9-
2025.10
60.51-
2025.11
60.11-
2026.03
59.7-
2026.03
59.6-
2026.02
59.5-
2025.10
58.53-
2026.02
58-
2026.02
57.5-
2026.02
57.5-
2026.02
56.5-
2026.02
56.5-
2026.02
56-
2026.03
55.9-
2026.02
55.5-
2026.02
55-
2026.02
55-
2026.03
54.4-
2026.02
54-
2026.02
53.5-
2026.02
53-
2026.02
52-
2025.11
51.7-
2026.02
51-
2025.10
50.48-
2025.11
50.34-
2025.10
49.45-
2026.03
49.2-
2026.03
47.24-
2025.11
46.91105.7
2026.02
46.5-
2026.03
45.5-
2026.03
45.4-
2026.03
45.3-
2025.11
45.26114.1
2025.11
45.13100.5
2025.11
44.56118.6
2025.11
44.3115.1
2026.03
44.14-
2026.03
43.7-
2025.11
43.23116.6
2026.03
42.9-
2026.03
42.07-
2025.11
41.28-
2026.03
40.4-
2026.03
39.66-
2026.02
36.5-
2025.10
35.09-
2026.03
33.79-
2025.11
32.27-
2026.03
30.69-
2026.03
22.07-
2025.11
18.15-
2025.11
17.12127.2
2025.11
16.27103.2
2026.03
13.79-
2025.11
11.3122.7