Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Multi-Turn Tool-Integrated Reasoning (TIR) on AIME25 (Peak avg@32 score)
Loading...
28.13
Peak avg@32 Score
OTB
18.1564
20.7457
23.335
25.9243
Feb 6, 2026
Peak avg@32 Score
Updated 1mo ago
Evaluation Results
Method
Method
Links
Peak avg@32 Score
OTB
Backbone=Qwen2.5-7B, R...
2026.02
28.13
SimpleTIR
Backbone=Qwen2.5-7B, R...
2026.02
26.67
OGB
Backbone=Qwen2.5-7B, R...
2026.02
21.15
OPO
Backbone=Qwen2.5-7B, R...
2026.02
20.83
RLOO
Backbone=Qwen2.5-7B, R...
2026.02
20.1
GRPO
Backbone=Qwen2.5-7B, R...
2026.02
18.54
Feedback
Search any
task
Search any
task