Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Scientific Reasoning on GPQA Diamond (pass@1)
Loading...
69.5
pass@1
SPLA
8.764
24.532
40.3
56.068
Oct 2, 2025
Oct 21, 2025
Nov 10, 2025
Nov 30, 2025
Dec 20, 2025
Jan 9, 2026
Jan 29, 2026
pass@1
Updated 1mo ago
Evaluation Results
Method
Method
Links
pass@1
SPLA
Model Size=14B, Temper...
2026.01
69.5
SPA
Model Size=14B, Temper...
2026.01
69.2
InfLLM-v2
Model Size=14B, Temper...
2026.01
68.7
Dense Attention
Model Size=14B, Temper...
2026.01
68.5
NSA
Model Size=14B, Temper...
2026.01
59.6
Continual LUFFY
Backbone=Qwen2.5-Math-...
2025.10
49
On-Policy (Continual)
Backbone=Qwen2.5-Math-...
2025.10
47
ExGRPO (Continual)
Backbone=Qwen2.5-Math-...
2025.10
42.4
GPG-Zero
Backbone=Qwen2.5-Math-...
2025.10
40.4
LUFFY
Backbone=Qwen2.5-Math-...
2025.10
39.9
On-Policy
Backbone=Qwen2.5-Math-...
2025.10
37.4
ExGRPO
Backbone=Qwen2.5-Math-...
2025.10
37.4
Qwen-Instruct
Backbone=Qwen2.5-Math-7B
2025.10
24.7
SFT
Backbone=Qwen2.5-Math-...
2025.10
24.7
RePO-Zero
Backbone=Qwen2.5-Math-...
2025.10
24.2
SFT+RL
Backbone=Qwen2.5-Math-...
2025.10
24.2
Oat-Zero
Backbone=Qwen2.5-Math-...
2025.10
23.7
PRIME-Zero
Backbone=Qwen2.5-Math-...
2025.10
18.2
Qwen-Base
Backbone=Qwen2.5-Math-7B
2025.10
11.1
Feedback
Search any
task
Search any
task