| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Summarization | XSum (test) | ROUGE-260.61 | 231 | |
| Summarization | Xsum | ROUGE-227.1 | 108 | |
| AI-generated text detection | XSum Generated by Claude3 (test) | AUROC100 | 60 | |
| AI-generated text detection | XSum Generated by GPT-4 (test) | AUROC0.9996 | 60 | |
| AI-generated text detection | XSum Generated by ChatGPT (test) | AUROC1 | 60 | |
| LGT Detection | XSum Fast-DetectGPT benchmark | AUROC100 | 54 | |
| Abstractive Summarization | XSUM (test) | ROUGE-L40.4 | 44 | |
| Membership Inference Attack | XSum (test) | AUC0.945 | 43 | |
| Language Generation | XSum | Accuracy24.89 | 35 | |
| Language Modeling | XSum | Perplexity14.71 | 26 | |
| Factual Consistency Evaluation | XSum-Faithful (XSF) | Spearman Correlation47 | 22 | |
| Factual Consistency Evaluation | XSumFaith (test) | Pearson Correlation Coefficient42.5 | 22 | |
| Understanding | XSum | Score46.76 | 20 | |
| Masked Language Modeling | XSUM randomly sampled | U-PPL3.8 | 20 | |
| LLM-generated text detection | XSum Claude3 Opus | TPR @ FPR 1%97.3 | 18 | |
| LLM-generated text detection | XSum GPT4 Turbo | TPR @ FPR 1%99.3 | 18 | |
| LLM-generated text detection | XSum GPT4 | TPR @ FPR 1%79.3 | 18 | |
| LLM-generated text detection | XSum GPT3.5 Turbo | TPR @ FPR 1%96.7 | 18 | |
| Abstractive Summarization | XSum | ROUGE-140.67 | 18 | |
| Abstractive Summarization | XSum (val) | ROUGE-20.1624 | 16 | |
| Summarization | XSUM in-domain (test) | D3 Score20 | 16 | |
| Summarization | XSum | ROUGE-227.1 | 14 | |
| Summarization | XSum | ROUGE-Lsum22.47 | 14 | |
| Abstractive Summarization | XSum | XSum Score17.9 | 14 | |
| Summarization | XSum | ROUGE-119.67 | 12 |