Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Reasoning on STB26
Loading...
64.5
Exact Match (EM)
STAR (Claude Sonnet 4.6)
28.724
38.012
47.3
56.588
May 11, 2026
Exact Match (EM)
Updated 22d ago
Evaluation Results
Method
Method
Links
Exact Match (EM)
STAR (Claude Sonnet 4.6)
Framework=STAR, Params...
2026.05
64.5
STAR (GPT-OSS-20B)
Framework=STAR, Params...
2026.05
59.2
STAR (Qwen3-8B)
Framework=STAR, Params...
2026.05
55.3
STAR (Claude Haiku 4.5)
Framework=STAR, Params...
2026.05
55
STAR (GLM-4-9B)
Framework=STAR, Params...
2026.05
48.3
GPT-OSS-20B
Framework=LLM-only, Pa...
2026.05
47.2
STAR (Ministral-3-8B)
Framework=STAR, Params...
2026.05
46.2
Claude Haiku 4.5
Framework=LLM-only, Pa...
2026.05
45.6
GLM-4-9B
Framework=LLM-only, Pa...
2026.05
45.6
STAR (Llama-3.1-8B)
Framework=STAR, Params...
2026.05
43.9
Ministral-3-8B
Framework=LLM-only, Pa...
2026.05
42
STAR (Llama-3.2-3B)
Framework=STAR, Params...
2026.05
39.1
Llama-3.1-8B
Framework=LLM-only, Pa...
2026.05
38.6
Qwen3-8B
Framework=LLM-only, Pa...
2026.05
37.5
Llama-3.2-3B
Framework=LLM-only, Pa...
2026.05
32
Claude Sonnet 4.6
Framework=LLM-only, Pa...
2026.05
30.1
Feedback
Search any
task
Search any
task