Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Reasoning and Math Suite on GSM8K, CommonSense, BoolQ, ARC Challenge, and HellaSwag

87.8Average Accuracy

SELF-REDTEAM

66.27271.86177.4583.039May 8, 2026
Updated 22d ago

Evaluation Results

MethodLinks
2026.05
87.8
2026.05
87.4
2026.05
87.4
2026.05
85
2026.05
84.7
2026.05
81.8
2026.05
74.5
2026.05
73.8
2026.05
67.1