| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Reasoning evaluation | DialogSum | Reasoning99.1 | 33 | |
| Summarization | DIALOGSUM | ROUGE-L51.6 | 27 | |
| Dialogue Summarization | DialogSum | R-L48.36 | 24 | |
| Summarization | DialogSum | ROUGE-L36.45 | 14 | |
| Summarization | DialogSum 1.5k examples (val) | ROUGE-L39.1 | 11 | |
| Summarization | DIALOGSUM | Std Dev ROUGE-10.83 | 8 | |
| Controllable Summarization | DialogSum | Extent20.45 | 7 | |
| Dialogue Summarization | DialogSum Single Client | ROUGE-150.92 | 6 | |
| Toxicity Evaluation | DialogSum (DS) | Toxic Fraction0 | 5 | |
| Counterfactual Fairness | DialogSum (DS) | Sentiment Parity0.1 | 5 | |
| Stereotyping Evaluation | DialogSum (DS) | Stereotype Fraction5.6 | 5 | |
| Dialogue Summarization | DialogSum 50 samples (test) | Informativeness4.03 | 3 |