Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Reasoning on 10 challenging reasoning tasks Combined
Loading...
46.7
Average Score
E3-TIR
26.004
31.377
36.75
42.123
Apr 10, 2026
Average Score
Updated 6d ago
Evaluation Results
Method
Method
Links
Average Score
E3-TIR
Backbone=Qwen2.5-3B, A...
2026.04
46.7
ARPO
Backbone=Qwen2.5-3B, A...
2026.04
45.2
Tool-Star
Backbone=Qwen2.5-3B, A...
2026.04
44.3
ReCall
Backbone=Qwen2.5-3B, A...
2026.04
40.5
Tree-GRPO
Backbone=Qwen2.5-3B, A...
2026.04
36.1
Search-R1
Backbone=Qwen2.5-3B, A...
2026.04
35.6
ToRL
Backbone=Qwen2.5-3B, A...
2026.04
32.2
SimpleTIR
Backbone=Qwen2.5-3B, A...
2026.04
31.7
Search-o1
Backbone=Qwen2.5-3B, A...
2026.04
30.7
Qwen2.5-3B-Instruct
Backbone=Qwen2.5-3B
2026.04
26.8
Feedback
Search any
task
Search any
task