| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| CommonsenseQA | Power0.9999 | 207 | 3mo ago | ||
| TriviaQA (val) | Llama-3.1-8B | PRR87.9 | 175 | 1d ago | |
| Stanford Cars | SYNC | Selective Prediction Error5.35 | 60 | 3mo ago | |
| CIFAR-100 | SYNC | Selective Prediction Error0.4 | 60 | 3mo ago | |
| ImageNet-100 | SN | Selective Prediction Error0.2 | 60 | 3mo ago | |
| MedicalQA | SE Probe | E-AURC0.3373 | 28 | 2mo ago | |
| BioASQ | E-AURC0.2744 | 28 | 2mo ago | ||
| TriviaQA | E-AURC0.3234 | 28 | 2mo ago | ||
| TriviaQA (test) | LEC | Power (α=0.1)100 | 24 | 3mo ago | |
| Classical Minds metacognitive battery (test) | Baseline Score95 | 20 | 1mo ago | ||
| SAMSum | PRR32.9 | 20 | 1mo ago | ||
| WMT ru 19 | MeanTokenEntropy | PRR33.8 | 20 | 1mo ago | |
| WMT de 19 | MeanTokenEntropy | PRR46.5 | 20 | 1mo ago | |
| WMT fr 14 | MeanTokenEntropy | Prediction Ranking Rate39.1 | 20 | 1mo ago | |
| WMT de 14 | MeanTokenEntropy | Prediction Ranking Rate34.8 | 20 | 1mo ago | |
| GSM8K | Evo-Anth-1 | PRR90.4 | 20 | 1mo ago | |
| TruthfulQA | Evo-Gpt-9 | PRR39.4 | 20 | 1mo ago | |
| MMLU | RAUQ | PRR75.9 | 20 | 1mo ago | |
| CoQA | PRR80.6 | 20 | 1mo ago | ||
| BaBi | RAUQ | PRR79.2 | 20 | 1mo ago | |
| NCQA (test) | Llama-3.1-8B | PRR99.1 | 19 | 1d ago | |
| 3 datasets (mean over all 21 runs) | SELFDOUBT | AUROC0.7895 | 16 | 1mo ago | |
| Classification Datasets Average (test) | Entropy | NAURC72.5 | 12 | 8d ago | |
| TriviaQA 200 samples (test) | Rejection Accuracy (80%)62.5 | 12 | 2mo ago | ||
| Diabetic Retinopathy (DR) (test) | Sale_EU_crit | AUSC0.65 | 10 | 3mo ago |