Share your thoughts, 1 month free Claude Pro on usSee more

Mathematical Reasoning on AIME 2025 (Accuracy, Average output tokens)

36.7Accuracy

SAPO

Updated 5mo ago

Evaluation Results

Method	Links
SAPO 2026.02		36.7	-
VESPO 2026.02		34.2	-
VESPO 2026.02		33.5	-
GSPO 2026.02		28.8	-
GRPO 2026.02		28.2	-
GRPO 2026.02		27.4	-
GSPO 2026.02		24.6	-
SAPO 2026.02		21.4	-
Kimi K2.5 2026.02		0.961	25,000
Gemini-3.0 2026.02		0.95	15,000
Kimi K2 2026.02		0.945	30,000
DeepSeek-V3.2 2026.02		0.931	16,000
Qwen3-8B 2025.08		0.7692	19,902.43
Qwen3-8B 2025.08		0.76	11,716.63
SAPO 2026.02		0.7	-
QwQ-32B 2025.08		0.6667	23,711.6
QwQ-32B 2025.08		0.6552	13,434.17
VESPO 2026.02		0.6	-
DeepSeek-R1-14B 2025.08		0.5862	18,000.1
DeepSeek-R1-14B 2025.08		0.5172	10,842.07
GRPO 2026.02		0.5	-
DeepSeek-R1-7B 2025.08		0.3571	18,203.23
DeepSeek-R1-7B 2025.08		0.3571	11,471.17
GSPO 2026.02		0.2	-