| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Emotion Detection | DailyDialog (test) | Micro-F10.6034 | 53 | |
| Dialogue Emotion Detection | DailyDialog | Micro F1 (- neutral)0.6167 | 27 | |
| Dialogue Generation | DailyDialog | Distinct-19.12 | 26 | |
| Knowledge Retrieval | DailyDialog | BERTScore Precision (avg)84.29 | 16 | |
| Dialogue Response Selection | DailyDialog reformatted multiple-choice (test) | Accuracy90.35 | 16 | |
| Dialogue Generation | DailyDialog Multi-reference | BLEU-138.46 | 16 | |
| Response Generation | DailyDialog (test) | BLEU-235.4 | 16 | |
| Emotion Recognition in Conversation | DailyDialog (test) | F1 Score0.6312 | 16 | |
| Emotion Recognition in Conversations | DailyDialog | Macro F159.33 | 15 | |
| Attribute-Controlled Dialogue Generation | DailyDialog-CG (test) | Emotion Accuracy (E-ACC)70.66 | 12 | |
| Dialogue Evaluation | DailyDialog (eval) | Spearman Correlation0.579 | 10 | |
| Dialogue | DailyDialog | R-114.99 | 10 | |
| Knowledge Pair-wise Diversity | DailyDialog (test) | Precision89.58 | 9 | |
| Human Logic Alignment | DailyDialog | Human Logic Alignment (T=0.5)80.97 | 9 | |
| Red Teaming | DailyDialog against DialoGPT-large | RSR40 | 8 | |
| Red Teaming | DailyDialog against BB-3B | RSR40.2 | 8 | |
| Dialogue Policy Evaluation | Dailydialog (test) | USR MLM81.1 | 8 | |
| DialogAct label control | DailyDialog multi-reference (test) | Accuracy80.25 | 7 | |
| Cause Entailment | DailyDialog (DD) (Fold 1) | F1 (Positive)69.2 | 7 | |
| Dialogue Generation | Dailydialog | Attribute Relevancy34.7 | 6 | |
| Response Generation | DailyDialog | Pairwise Diversity78.5 | 6 | |
| Text Generation | DailyDialog (test) | BERTscore0.8404 | 6 | |
| Dialogue Coherence | DailyDialog | QuantiDCE3.24 | 6 | |
| Dialogue Emotion Recognition | DailyDialog | Micro F1 (Neutral)0.5629 | 6 | |
| Dialogue Evaluation | DailyDialog | USR RET0.998 | 4 |