Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Exact Match Performance on Big-Bench Hard (BBH)

77.3Boolean Expressions EM

Base

55.04460.82266.672.378Nov 10, 2025
Updated 1mo ago

Evaluation Results

MethodLinks
2025.11
77.3-59.853.350.741.322.75.3038.7
2025.11
77.3-51.553.25235.273.141.61.948.1
2025.11
76.7-5449.356.720.70323440.4
2025.11
76.7-47.15050243.326.74240
2025.11
76.7-59.855.350.74476.749.34.752.2
2025.11
74.7-62.154.750.7448049.31253.3
2025.11
74-59.856.750.740.776.75412.753.6
2025.11
72.7-48.352.754.7388.730.734.742.5
2025.11
71.3-48.3525027.31.314.741.338.3
2025.11
68-49.448.75246.700633.8
2025.11
68-48.348.751.344005.333.2
2025.11
66.7-48.348.75444007.333.6
2025.11
66-59.853.350.743.378.751.39.352.6
2025.11
66-27.60.74646003.323.7
2025.11
66-14.9053.338.70.701.321.9
2025.11
64.5-5442.748.539.18.921.616.737
2025.11
58-41.425.345.335.30211.327.3
2025.11
55.9-43.431.751.740.926.813.94.433.6