Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Correctness Prediction on Math operations
Loading...
0.913
AUROC
Verb. conf.
0.31396
0.46948
0.625
0.78052
Sep 12, 2025
AUROC
Updated 1mo ago
Evaluation Results
Method
Method
Links
AUROC
Verb. conf.
Model=Llama 3.3 70B In...
2025.09
0.913
Direction
Model=Llama 3.1 8B, Tr...
2025.09
0.858
Direction
Model=DeepSeek R1 Dist...
2025.09
0.847
Direction
Model=Ministral 8B Ins...
2025.09
0.844
Direction
Model=Qwen 2.5 7B Inst...
2025.09
0.837
Direction
Model=Llama 3.3 70B In...
2025.09
0.835
Direction
Model=Mistral 7B Instr...
2025.09
0.782
Verb. conf.
Model=Llama 3.1 8B
2025.09
0.623
Verb. conf.
Model=Mistral 7B Instr...
2025.09
0.617
Assessor
Model=Llama 3.1 8B, As...
2025.09
0.528
Verb. conf.
Model=Qwen 2.5 7B Inst...
2025.09
0.517
Verb. conf.
Model=Ministral 8B Ins...
2025.09
0.5
Verb. conf.
Model=DeepSeek R1 Dist...
2025.09
0.499
Assessor
Model=Mistral 7B Instr...
2025.09
0.493
Assessor
Model=Ministral 8B Ins...
2025.09
0.454
Assessor
Model=Llama 3.3 70B In...
2025.09
0.449
Assessor
Model=Qwen 2.5 7B Inst...
2025.09
0.4
Assessor
Model=DeepSeek R1 Dist...
2025.09
0.337
Feedback
Search any
task
Search any
task