| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| MT-Bench (test) | LoRA | GPT-4 Score8.36 | 46 | 3mo ago | |
| MT-Bench | GPT-4o | MT-Bench Score9.3 | 41 | 1mo ago | |
| IFEval | FuseChat3.0 | IFEval80.2 | 34 | 11d ago | |
| AlpacaEval 2 | FuseChat3.0 | AlpacaEval2 Score64.2 | 34 | 11d ago | |
| 4 dialogue tasks (Skill Talk, Empathetic Dialogues, Wizard of Internet, Wizard of Wikipedia) (test) | F1 Score13.7 | 24 | 3mo ago | ||
| Dialogue | HBAT | PandaLM77.79 | 18 | 3mo ago | |
| Anthropic-HH (distillation set) | Response Word Count73.53 | 16 | 3mo ago | ||
| MT-Bench | DFlash+DDTree | Speedup4.18 | 12 | 1mo ago | |
| DailyDialog | GPT2-tree | R-114.99 | 10 | 3mo ago | |
| WoW | MindRef | F1 Score14.77 | 8 | 3mo ago | |
| USR (N = 198) | TCVA | Spearman's Rho0.173 | 7 | 1mo ago | |
| GROWOVER-DIALOGUE (NEW) | RiLM | BLEU (Month 9)5.36 | 6 | 3mo ago | |
| GROWOVER-DIALOGUE (UNCHANGED) | RiLM | BLEU (Month 9)4.68 | 6 | 3mo ago | |
| Dialogue (test) | Fluency8.84 | 5 | 3mo ago | ||
| WildChat | BACO best | Lexical Coverage47.3 | 4 | 17h ago | |
| MT-Bench | PARSE + EAGLE3 | TPS (tok/s)194 | 4 | 26d ago | |
| Dialogue dataset | M-RAG | BLEU-124.52 | 4 | 3mo ago | |
| WildSpeech-Bench | Score76.3 | 3 | 1mo ago | ||
| SpeechRole | Score124.2 | 3 | 1mo ago | ||
| URO-Bench-pro | Understanding Score69.1 | 3 | 1mo ago | ||
| VoiceBench | Qwen3.5-Omni-Plus | Score93.1 | 3 | 1mo ago | |
| TruthfulQA | Accuracy92.2 | 2 | 2mo ago | ||
| MT-Bench (full set) | Accuracy (%)9.3 | 2 | 2mo ago |