| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| CommonsenseQA | Power0.9999 | 207 | 1mo ago | ||
| Stanford Cars | SYNC | Selective Prediction Error5.35 | 60 | 1mo ago | |
| CIFAR-100 | SYNC | Selective Prediction Error0.4 | 60 | 1mo ago | |
| ImageNet-100 | SN | Selective Prediction Error0.2 | 60 | 1mo ago | |
| MedicalQA | SE Probe | E-AURC0.3373 | 28 | 25d ago | |
| BioASQ | E-AURC0.2744 | 28 | 25d ago | ||
| TriviaQA | E-AURC0.3234 | 28 | 25d ago | ||
| TriviaQA (test) | LEC | Power (α=0.1)100 | 24 | 1mo ago | |
| SAMSum | PRR32.9 | 20 | 12d ago | ||
| WMT ru 19 | MeanTokenEntropy | PRR33.8 | 20 | 12d ago | |
| WMT de 19 | MeanTokenEntropy | PRR46.5 | 20 | 12d ago | |
| WMT fr 14 | MeanTokenEntropy | Prediction Ranking Rate39.1 | 20 | 12d ago | |
| WMT de 14 | MeanTokenEntropy | Prediction Ranking Rate34.8 | 20 | 12d ago | |
| GSM8K | Evo-Anth-1 | PRR90.4 | 20 | 12d ago | |
| TruthfulQA | Evo-Gpt-9 | PRR39.4 | 20 | 12d ago | |
| MMLU | RAUQ | PRR75.9 | 20 | 12d ago | |
| CoQA | PRR80.6 | 20 | 12d ago | ||
| BaBi | RAUQ | PRR79.2 | 20 | 12d ago | |
| 3 datasets (mean over all 21 runs) | SELFDOUBT | AUROC0.7895 | 16 | 9d ago | |
| TriviaQA 200 samples (test) | Rejection Accuracy (80%)62.5 | 12 | 18d ago | ||
| Diabetic Retinopathy (DR) (test) | Sale_EU_crit | AUSC0.65 | 10 | 1mo ago | |
| Diabetic Retinopathy (DR) grading patient-stratified (test) | Sale_EU_crit | AUSC (Critical FNR)0.65 | 10 | 1mo ago | |
| NyayaBench v2 | WSR Betting + LTT | Guaranteed Test Coverage (alpha=0.20)41.1 | 9 | 1mo ago | |
| 3 datasets (Trace) | SELFDOUBT | AUROC0.7984 | 8 | 9d ago | |
| BBH, GPQA, and MMLU-Pro Pooled (test) | N0/N Count1,384 | 8 | 9d ago |