| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Summarization | XSum (test) | ROUGE-260.61 | 276 | |
| Summarization | Xsum | ROUGE-227.1 | 108 | |
| Prompt Optimization | XSum | Hypervolume (HV)0.1626 | 72 | |
| Summarization | XSum | PRR0.617 | 66 | |
| Selective Generation | XSum | ROC-AUC85.9 | 66 | |
| AI-generated text detection | XSum Generated by Claude3 (test) | AUROC100 | 60 | |
| AI-generated text detection | XSum Generated by GPT-4 (test) | AUROC0.9996 | 60 | |
| AI-generated text detection | XSum Generated by ChatGPT (test) | AUROC1 | 60 | |
| LGT Detection | XSum Fast-DetectGPT benchmark | AUROC100 | 54 | |
| Summarization | XSum | ROUGE-29.16 | 46 | |
| Abstractive Summarization | XSUM (test) | ROUGE-L40.4 | 44 | |
| Membership Inference Attack | XSum (test) | AUC0.945 | 43 | |
| Summarization | XSum | ROUGE-L27.45 | 42 | |
| Differential Privacy Auditing | Xsum | Empirical Privacy Loss (epsilon)0 | 40 | |
| Machine-generated text detection | XSum | AUROC100 | 40 | |
| Soft constrained reward optimization | XSum on Gemma-7B | Average Soft Constrained Reward0.147 | 36 | |
| Language Generation | XSum | Accuracy24.89 | 35 | |
| Language Modeling | XSum | Perplexity14.71 | 26 | |
| Prompt injection attack detection | XSum | TPR100 | 22 | |
| Abstractive Summarization | XSum | ROUGE-140.67 | 22 | |
| Factual Consistency Evaluation | XSum-Faithful (XSF) | Spearman Correlation47 | 22 | |
| Factual Consistency Evaluation | XSumFaith (test) | Pearson Correlation Coefficient42.5 | 22 | |
| Understanding | XSum | Score46.76 | 20 | |
| Masked Language Modeling | XSUM randomly sampled | U-PPL3.8 | 20 | |
| Summarization | XSum | Speedup vs AR1.84 | 19 |