Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Reasoning and Math Suite on GSM8K, CommonSense, BoolQ, ARC Challenge, and HellaSwag
Loading...
87.8
Average Accuracy
SELF-REDTEAM
66.272
71.861
77.45
83.039
May 8, 2026
Average Accuracy
Updated 22d ago
Evaluation Results
Method
Method
Links
Average Accuracy
SELF-REDTEAM
Backbone=Qwen2.5-14B-IT
2026.05
87.8
Qwen2.5-14B-IT
Backbone=Qwen2.5-14B-IT
2026.05
87.4
ABS
Backbone=Qwen2.5-14B-IT
2026.05
87.4
Qwen2.5-7B-IT
Backbone=Qwen2.5-7B-IT
2026.05
85
ABS
Backbone=Qwen2.5-7B-IT
2026.05
84.7
SELF-REDTEAM
Backbone=Qwen2.5-7B-IT
2026.05
81.8
Qwen2.5-3B-IT
Backbone=Qwen2.5-3B-IT
2026.05
74.5
ABS
Backbone=Qwen2.5-3B-IT
2026.05
73.8
SELF-REDTEAM
Backbone=Qwen2.5-3B-IT
2026.05
67.1
Feedback
Search any
task
Search any
task