Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Correctness Prediction (AUROC) on GSM8K
Loading...
60.1
AUROC
Direction
49.492
52.246
55
57.754
Sep 12, 2025
AUROC
Updated 1mo ago
Evaluation Results
Method
Method
Links
AUROC
Direction
Model=Qwen 2.5 7B Inst...
2025.09
60.1
Verb. conf.
Model=Llama 3.3 70B In...
2025.09
59.8
Assessor
Model=Ministral 8B Ins...
2025.09
59.8
Assessor
Model=Qwen 2.5 7B Inst...
2025.09
58.4
Direction
Model=Mistral 7B Instr...
2025.09
57.9
Direction
Model=Ministral 8B Ins...
2025.09
57.8
Verb. conf.
Model=Ministral 8B Ins...
2025.09
57.7
Assessor
Model=DeepSeek R1 Dist...
2025.09
57.6
Assessor
Model=Llama 3.3 70B In...
2025.09
57.3
Assessor
Model=Mistral 7B Instr...
2025.09
55.9
Assessor
Model=Llama 3.1 8B, As...
2025.09
55.8
Direction
Model=DeepSeek R1 Dist...
2025.09
55.2
Verb. conf.
Model=Llama 3.1 8B
2025.09
54
Direction
Model=Llama 3.1 8B, Tr...
2025.09
53.4
Verb. conf.
Model=Mistral 7B Instr...
2025.09
52.5
Verb. conf.
Model=Qwen 2.5 7B Inst...
2025.09
51.3
Verb. conf.
Model=DeepSeek R1 Dist...
2025.09
50.3
Direction
Model=Llama 3.3 70B In...
2025.09
49.9
Feedback
Search any
task
Search any
task