| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Best feasible prompt identification | CNN/DailyMail (test) | Average Soft Constrained Reward0.172 | 72 | |
| Abstractive Summarization | CNN/DailyMail full length F-1 (test) | ROUGE-141.69 | 48 | |
| Open ended generation | CNN DailyMail | ROUGE-L24.3 | 40 | |
| Pareto prompt set identification | CNN/DailyMail | Hypervolume (HV)18.03 | 36 | |
| Language Generation | CNN/DailyMail | Accuracy27.16 | 35 | |
| Summarization | CNN/DailyMail (test) | ROUGE-L48 | 33 | |
| Uncertainty Quantification | CNN/DailyMail | Hamming AUC0.745 | 28 | |
| Summarization | CNN/DailyMail | Hamming Score-0.276 | 28 | |
| Abstractive Summarization | CNN/DailyMail | ROUGE-144.51 | 25 | |
| Summarization | CNN DailyMail | PRR0.45 | 22 | |
| Summarization | CNN/DailyMail | RougeL23.23 | 21 | |
| Abstractive Summarization | CNN/DailyMail Summarization | Hamming Distance1.597 | 20 | |
| Text Summarization | CNN/DailyMail | BA97.34 | 16 | |
| Reranking | CNN DailyMail | R-152.43 | 15 | |
| Context Compression | CNN/DailyMail | ROUGE-144.89 | 13 | |
| Text Summarization | CNN DailyMail | ROUGE-138.58 | 13 | |
| Long-context Modeling | CNN/DailyMail | Speedup3.85 | 12 | |
| Binary Classification (Assistive vs Creative) | CNN DailyMail | AUC99 | 12 | |
| Binary Classification (Human vs Creative) | CNN/DailyMail | AUC99 | 12 | |
| Binary Classification (Human vs Assistive) | CNN/DailyMail | AUC0.99 | 12 | |
| Attribution Quality Evaluation | CNN DailyMail | Log-Prob Drop1.371 | 12 | |
| Short-Generation | CNN/DailyMail | ROUGE-121.15 | 10 | |
| Length-Constrained Text Generation | CNN/DailyMail | Win Rate16.43 | 10 | |
| Text Generation | CNN/DailyMail (test) | LCTG Error Rate (E)3.18 | 10 | |
| Text Summarization | CNN/DailyMail (test) | ROUGE-133.23 | 9 |