| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Emotional Text-to-Speech | ESD (English) | SMOS4.35 | 16 | |
| Speech Emotion Recognition | ESD In-Domain v1 (test) | ACC93.86 | 13 | |
| Object Detection | ESD | AP46.5 | 13 | |
| Open-set speaker identification | ESD (test) | EER0.61 | 12 | |
| Text-to-Speech | ESD (test) | MOS4.47 | 11 | |
| Target Speaker Extraction | ESD (test) | SI-SDRi (dB)16.67 | 8 | |
| Empathetic Response Generation | ESD | Emotional Reaction1.851 | 8 | |
| Emotion Style Transfer | ESD (test) | UTMOS3.93 | 7 | |
| Text-to-Speech | ESD | MOS (Happy)3.87 | 6 | |
| Speech Synthesis | ESD Zh | WER2.4 | 5 | |
| Cross-speaker style transfer | ESD (test) | nMOS3.638 | 5 | |
| Emotional Speech Synthesis | ESD English (test) | Score (Neutral)78.39 | 5 | |
| Text-to-Speech | ESD English (test) | WER6.8 | 5 | |
| Speech Emotion Recognition | ESD | UA98.9 | 5 | |
| Instance Segmentation | ESD-1 (test) | Accuracy (2 Objects)95 | 5 | |
| Voice Conversion | ESD | WER0.149 | 4 | |
| Chain Generation | ESD-CoT (test) | B-144.87 | 3 |