Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
STEM Reasoning on TheoremQA
Loading...
36.8
Accuracy
Bingo-A
28.272
30.486
32.7
34.914
Jun 9, 2025
Accuracy
Length
Logical Accuracy
Updated 11d ago
Evaluation Results
Method
Method
Links
Accuracy
Length
Logical Accuracy
Bingo-A
Base Model=DeepSeek-R1...
2025.06
36.8
1,648
32.9
Bingo-E
Base Model=DeepSeek-R1...
2025.06
36.7
1,584
33
DAST
Base Model=DeepSeek-R1...
2025.06
35.2
2,325
29.8
Demystifying
Base Model=DeepSeek-R1...
2025.06
35.1
1,976
30.6
Effi. Reasoning
Base Model=DeepSeek-R1...
2025.06
34.8
3,560
26.2
kimi-k1.5
Base Model=DeepSeek-R1...
2025.06
34.4
2,136
29.6
Vanilla PPO
Base Model=DeepSeek-R1...
2025.06
32.3
4,146
22.7
O1-Pruner
Base Model=DeepSeek-R1...
2025.06
28.6
524
27.6
Feedback
Search any
task
Search any
task