| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Text Classification | BIOS | Task Accuracy84.6 | 39 | |
| Factuality | BIOS | Factuality56 | 28 | |
| Confidence Estimation (Iterative Tagging) | Bios | Brier Score (BS)7.5 | 17 | |
| Long-form generation factuality and uncertainty estimation | Bios (test) | FA71.4 | 14 | |
| Factual Precision Evaluation | Bios | FACTSCORE83 | 10 | |
| Classification | Bios (test) | Accuracy80.1 | 7 | |
| Attribute-conditional generation | BIOS | Control Accuracy99.2 | 5 | |
| Confidence Estimation (Freeform Tagging) | Bios | Brier Score (BS)9.2 | 3 | |
| Distribution Inference Attack mitigation | Bios sex (M → F) | Adversarial Gap0.9 | 2 |