| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| MT-Bench (test) | LoRA | GPT-4 Score8.36 | 46 | 1mo ago | |
| MT-Bench | GPT-4o | MT-Bench Score9.3 | 29 | 8d ago | |
| 4 dialogue tasks (Skill Talk, Empathetic Dialogues, Wizard of Internet, Wizard of Wikipedia) (test) | F1 Score13.7 | 24 | 1mo ago | ||
| Dialogue | HBAT | PandaLM77.79 | 18 | 1mo ago | |
| Anthropic-HH (distillation set) | Response Word Count73.53 | 16 | 1mo ago | ||
| MT-Bench | DFlash+DDTree | Speedup4.18 | 12 | 3d ago | |
| DailyDialog | GPT2-tree | R-114.99 | 10 | 1mo ago | |
| WoW | MindRef | F1 Score14.77 | 8 | 1mo ago | |
| USR (N = 198) | TCVA | Spearman's Rho0.173 | 7 | 5d ago | |
| GROWOVER-DIALOGUE (NEW) | RiLM | BLEU (Month 9)5.36 | 6 | 1mo ago | |
| GROWOVER-DIALOGUE (UNCHANGED) | RiLM | BLEU (Month 9)4.68 | 6 | 1mo ago | |
| Dialogue (test) | Fluency8.84 | 5 | 1mo ago | ||
| Dialogue dataset | M-RAG | BLEU-124.52 | 4 | 1mo ago | |
| TruthfulQA | Accuracy92.2 | 2 | 22d ago | ||
| MT-Bench (full set) | Accuracy (%)9.3 | 2 | 22d ago |