| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Long-form generation factuality and uncertainty estimation | WildHallu (test) | Factuality Score0.86 | 14 | |
| Confidence Estimation (Freeform Tagging) | WildHallu | Brier Score (BS)4.1 | 11 | |
| Confidence Estimation (Iterative Tagging) | WildHallu | Brier Score (BS)5.7 | 9 |