Share your thoughts, 1 month free Claude Pro on us
See more
Home
/
Benchmarks
Correctness Prediction on ProntoQA
Loading...
79.9
AUROC
Llama 3.1
46.308
55.029
63.75
72.471
May 8, 2026
AUROC
Updated 23d ago
Evaluation Results
Method
Method
Links
AUROC
Llama 3.1
Classifier=Logistic Re...
2026.05
79.9
Llama 3.1
Classifier=Gradient Bo...
2026.05
76.2
DeepSeek R1
Classifier=Logistic Re...
2026.05
67.2
Qwen 3
Classifier=Logistic Re...
2026.05
65.7
DeepSeek R1
Classifier=Gradient Bo...
2026.05
63.9
DeepSeek R1
Classifier=Self-Certainty
2026.05
61.5
Qwen 3
Classifier=Self-Certainty
2026.05
61.1
Llama 3.1
Classifier=Self-Certainty
2026.05
56.6
Qwen 2.5
Classifier=Self-Certainty
2026.05
56.5
Qwen 3
Classifier=Gradient Bo...
2026.05
55.1
Llama 3.2
Classifier=Logistic Re...
2026.05
55
Llama 3.2
Classifier=Self-Certainty
2026.05
54.9
Llama 3.2
Classifier=Gradient Bo...
2026.05
53.3
Qwen 2.5
Classifier=Logistic Re...
2026.05
51.9
Qwen 2.5
Classifier=Gradient Bo...
2026.05
47.6
Feedback
Search any
task
Search any
task