Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

General Reasoning on BBH (test)

86.5Accuracy

Self-Consistency

24.93240.91656.972.884Dec 22, 2025Jan 14, 2026Feb 6, 2026Mar 1, 2026Mar 24, 2026Apr 16, 2026May 9, 2026
Updated 22d ago

Evaluation Results

MethodLinks
2026.05
86.5----
2026.05
86.5----
2026.05
85.5----
2026.05
85----
2026.05
84----
2026.05
83----
2026.05
82.5----
2026.05
82.5----
2025.12
81.8----
2025.12
78.7----
2026.05
78.5----
2026.05
77.5----
2025.12
76.6----
2025.12
72.8----
2025.12
72.6----
2025.12
72.3----
2025.12
69.7----
2025.12
68.7----
2025.12
67.5----
2025.12
64.6----
2025.12
64.2----
2025.12
62.6----
2025.12
57.1----
2025.12
56.6----
2025.12
55.3----
2025.12
52.8----
2025.12
50.7----
2025.12
37.6----
2025.12
37.2----
2025.12
27.3----
2023.06
-60.864.156.445.9
2023.06
-55.656.562.136.7
2023.06
-71.156.961.144.4
2023.06
-68.567.332.145.5
2023.06
-65.556.923.545.6
2023.06
-91.968.555.747.5
2023.06
-647479.268.5
2023.06
-88.475.779.362.8
2023.06
-93.375.580.966.4
2023.06
-95.277.176.769.1