Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Mathematics on AIME 25 (Avg@32)
Loading...
80.2
Avg@32
GLM 4.6
-1.44
19.755
40.95
62.145
Dec 30, 2025
Jan 23, 2026
Feb 17, 2026
Mar 14, 2026
Apr 7, 2026
May 2, 2026
May 27, 2026
Avg@32
Updated 6d ago
Evaluation Results
Method
Method
Links
Avg@32
GLM 4.6
Evaluation Mode=Chat
2025.12
80.2
LongCat-Flash Exp-Chat
Evaluation Mode=Chat
2025.12
74.9
LongCat-Flash Chat
Evaluation Mode=Chat
2025.12
61.3
DeepSeek V3.2
Evaluation Mode=Chat
2025.12
56.5
IBTPO
Base Model=Qwen3-8B-Base
2026.05
15.3
TreeRL
Base Model=Qwen3-8B-Base
2026.05
14.9
TreePO
Base Model=Qwen3-8B-Base
2026.05
14.7
IBRO
Base Model=Qwen3-8B-Base
2026.05
14.5
GRPO w/ Entropy Reg†
Base Model=Qwen3-8B-Base
2026.05
14
Vanilla GRPO
Base Model=Qwen3-8B-Base
2026.05
13.6
GRPO w/ Clip-higher
Base Model=Qwen3-8B-Base
2026.05
13.5
Initial Model
Base Model=Qwen3-8B-Base
2026.05
8.9
IBTPO
Base Model=Qwen3-1.7B-...
2026.05
6.7
TreeRL
Base Model=Qwen3-1.7B-...
2026.05
4.6
Vanilla GRPO
Base Model=Qwen3-1.7B-...
2026.05
4.5
GRPO w/ Clip-higher
Base Model=Qwen3-1.7B-...
2026.05
4.5
TreePO
Base Model=Qwen3-1.7B-...
2026.05
4.5
IBRO
Base Model=Qwen3-1.7B-...
2026.05
4.1
GRPO w/ Entropy Reg†
Base Model=Qwen3-1.7B-...
2026.05
2.2
Initial Model
Base Model=Qwen3-1.7B-...
2026.05
1.7
Feedback
Search any
task
Search any
task