Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Planning on Countdown (held-out)
Loading...
87.96
Pass@1
INSIGHT
70.9144
75.3397
79.765
84.1903
Mar 2, 2026
Pass@1
Updated 1mo ago
Evaluation Results
Method
Method
Links
Pass@1
INSIGHT
Model=QWEN3-4B
2026.03
87.96
MoPPS
Model=QWEN3-4B
2026.03
87.8
RANDOM
Model=QWEN3-4B
2026.03
84.9
INVERSE-EVIDENCE
Model=QWEN3-4B
2026.03
83.59
DS
Model=R1-DISTILL-QWEN-7B
2026.03
83.05
INSIGHT
Model=R1-DISTILL-QWEN-7B
2026.03
81.73
MoPPS
Model=R1-DISTILL-QWEN-7B
2026.03
81.66
EXPECTED-DIFFICULTY
Model=R1-DISTILL-QWEN-7B
2026.03
81.2
RANDOM
Model=R1-DISTILL-QWEN-7B
2026.03
79.02
INVERSE-EVIDENCE
Model=R1-DISTILL-QWEN-7B
2026.03
78.32
INSIGHT
Model=QWEN3-0.6B
2026.03
76.7
MoPPS
Model=QWEN3-0.6B
2026.03
76.28
INVERSE-EVIDENCE
Model=QWEN3-0.6B
2026.03
72.13
RANDOM
Model=QWEN3-0.6B
2026.03
71.57
Feedback
Search any
task
Search any
task