| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Phonetic Transcription | VCTK++ (test) | F1 Score93 | 25 | |
| Voice Conversion | VCTK | WER0 | 21 | |
| Speech Super-resolution | VCTK 0.92 (test) | LSD0.7 | 16 | |
| Audio Super-resolution | VCTK Multi-speaker (test) | SNR20 | 15 | |
| Audio Super-resolution | VCTK Single-speaker (test) | SNR19.5 | 15 | |
| Audio-to-Text Retrieval | VCTK A→T | Recall@196.1 | 15 | |
| Pitch Shift | VCTK (10% unseen utterances) | MOS4.05 | 15 | |
| Time-scale modification | VCTK (10% unseen utterances) | MOS3.98 | 15 | |
| Text-to-Speech | VCTK | WER1.7 | 13 | |
| Speech Super-resolution | VCTK 16 kHz target sampling rate 0.92 (test) | LSD0.78 | 11 | |
| Neural Vocoding | VCTK 100 audio clips (unseen) | MAE0.0925 | 10 | |
| Speaker-ID | VCTK (test) | Accuracy99.3 | 10 | |
| Voice Conversion | VCTK (test) | nMOS4.26 | 9 | |
| Speech Synthesis | VCTK (OD) | PESQ4.5 | 9 | |
| Text-to-Speech | VCTK (test) | MOS4.4 | 8 | |
| Neural Vocoding | VCTK (unseen speakers) | MOS4.37 | 8 | |
| Bandwidth Extension | VCTK-BWE BW=2K (test) | WVMOS4.306 | 7 | |
| Speech Separation | VCTK 2 Speech | SI-SDR14.52 | 7 | |
| Audio Super-Resolution | VCTK 4 kHz input sampling rate (test) | WER1 | 7 | |
| Audio Super-Resolution | VCTK 2 kHz input sampling rate (test) | WER1 | 7 | |
| Speech Reconstruction | VCTK subset | PESQ (WB)2.36 | 7 | |
| Dysfluency Detection | VCTK++ | F1 Score90 | 7 | |
| Mel-spectrogram inversion | VCTK (unseen speakers) | MOS3.79 | 7 | |
| Audio Super-Resolution | VCTK (test) | LSD2.1 | 7 | |
| Bandwidth Extension | VCTK-BWE BW=1K (test) | WVMOS4.154 | 6 |