| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| TRICKME | VANILLA-VERB | Accuracy53.88 | 20 | 1mo ago | |
| GRACE | VANILLA-VERB | Accuracy48.27 | 20 | 1mo ago | |
| GUESS | VANILLA-VERB | Accuracy0.1858 | 20 | 1mo ago | |
| 20Q | VANILLA-VERB | Accuracy33.87 | 20 | 1mo ago | |
| Open-Set | GPT-4.1 + Verbalized Conf. | Accuracy67.2 | 16 | 17d ago | |
| Closed-Set tasks | GPT-4.1 + Verbalized Conf. | Accuracy (ACC)72 | 16 | 17d ago | |
| Polyp OOD | AURC14.1 | 13 | 1mo ago | ||
| Polyp ID | AURC6.8 | 13 | 1mo ago | ||
| Optic Cup OOD | AURC0.605 | 13 | 1mo ago | ||
| Optic Cup ID | AURC0.117 | 13 | 1mo ago | ||
| MSWML OOD | TLA | AURC61.2 | 13 | 1mo ago | |
| MSWML ID | AURC41.8 | 13 | 1mo ago | ||
| Skin Cancer | AURC8.5 | 13 | 1mo ago | ||
| Breast Cancer | AURC0.343 | 13 | 1mo ago | ||
| Brain Tumor | AURC0.089 | 13 | 1mo ago | ||
| VECBench OOD VER | EmoCaliber | ECE0.1217 | 13 | 1mo ago | |
| VECBench ID VSA | Qwen3-VL | ECE0.37 | 13 | 1mo ago | |
| VECBench ID VER | EmoCaliber | ECE13.63 | 13 | 1mo ago | |
| Infeasible Benchmark | Verb | Kaware0.961 | 8 | 25d ago | |
| Sware | Verb | Kaware99 | 8 | 25d ago | |
| KUQ | Verb | Kaware0.955 | 8 | 25d ago | |
| MediTOD | MedConf | AUROC68.7 | 7 | 1mo ago | |
| DDXPlus | MedConf | AUROC0.795 | 7 | 1mo ago | |
| VSR (test) | Vision-based confidence estimation framework | AUROC67.4 | 6 | 1mo ago | |
| MedQA (test) | MedConf | AUROC0.69 | 3 | 1mo ago |