Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Domain Reasoning on HL
Loading...
75
Accuracy
Best-of-N (N=3)
46.92
54.21
61.5
68.79
Mar 20, 2026
Accuracy
Updated 25d ago
Evaluation Results
Method
Method
Links
Accuracy
Best-of-N (N=3)
Model=Llama3.1-8B, Ver...
2026.03
75
RM-Regen
Model=Llama3.1-8B, Ver...
2026.03
75
RM-Regen
Model=GPT-3.5, Verific...
2026.03
73
RM-Regen
Model=Gemma2-9B, Verif...
2026.03
73
RM-Regen
Model=Llama3.1-8B
2026.03
73
Reflexion(3 iters)
Model=GPT-3.5, Verific...
2026.03
72
ReflectEvo
Model=Gemma2-9B, Verif...
2026.03
72
Reflexion(3 iters)
Model=Gemma2-9B, Verif...
2026.03
71
RM-Regen
Model=GPT-3.5
2026.03
71
RM-Regen
Model=Gemma2-9B
2026.03
71
Best-of-N (N=3)
Model=GPT-3.5, Verific...
2026.03
70
Reflexion(3 iters)
Model=Llama3.1-8B, Ver...
2026.03
70
ST CoT
Model=GPT-3.5, iterati...
2026.03
69
ProCo
Model=Gemma2-9B, itera...
2026.03
69
ST CoT
Model=Llama3.1-8B, ite...
2026.03
68
Best-of-N (N=3)
Model=Gemma2-9B, Verif...
2026.03
67
ST CoT
Model=Gemma2-9B, itera...
2026.03
66
ProCo
Model=GPT-3.5, iterati...
2026.03
65
Self-Refine
Model=Llama3.1-8B, ite...
2026.03
65
ReflectEvo
Model=Llama3.1-8B, Ver...
2026.03
62
Self-Refine
Model=Gemma2-9B, itera...
2026.03
59
Self-Refine
Model=GPT-3.5, iterati...
2026.03
53
ProCo
Model=Llama3.1-8B, ite...
2026.03
48
Feedback
Search any
task
Search any
task