| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Question Answering | Bio (test) | LLM-Judge Score82.9 | 105 | |
| Long-form Generation | Bio | LLM-Judge Score81 | 45 | |
| Factuality Correction | BIO (test) | Precision51 | 44 | |
| Factuality Correction | BIO dataset | Factual Precision93 | 24 | |
| Long-form Biography Generation | Bio FactScore | FactScore81.2 | 17 | |
| Question Answering | Bio poison @ Position 10, k=10 (test) | Robustness Score (LLM-J)79.9 | 15 | |
| Question Answering | Bio poison @ Position 1, k=10 (test) | Rob. LLM-J Score79.3 | 15 | |
| Conformal Prediction | bio (test) | Marginal Coverage90 | 14 | |
| Topic Modeling | Bio | IRBO100 | 13 | |
| Topic Modeling | Bio | NPMI0.191 | 13 | |
| Document Clustering | Bio (test) | NMI0.557 | 13 | |
| Tabular Classification | BIO M (test) | Macro F180.1 | 9 | |
| Factuality Evaluation | BIO (test) | FS Score88.9 | 8 | |
| AMR Parsing | BIO | Smatch62.8 | 8 | |
| Retrieval Question Answering | Bio | MRR0.15 | 6 | |
| Conjunctive Query Answering | Bio queries (test) | AUC91 | 6 | |
| Conformal Prediction | Bio | Empirical Coverage90 | 4 | |
| Regression | bio (test) | Max Conditional Coverage Deviation4.7 | 4 |