| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Text-to-audio generation | AudioCaps (test) | KL Divergence0 | 195 | |
| text-to-audio retrieval | AudioCaps (test) | Recall@166.59 | 180 | |
| Audio Captioning | AudioCaps (test) | CIDEr91.1 | 157 | |
| Audio-to-text Retrieval | AudioCaps (test) | R@165.6 | 69 | |
| Audio Captioning | AudioCaps | CIDEr80.3 | 66 | |
| Text-to-audio retrieval | AudioCaps | Recall@155.2 | 57 | |
| Audio Retrieval | AudioCaps | R@152 | 56 | |
| Audio Editing | AudioCaps | FD (Frechet Distance)12.38 | 24 | |
| Cross-modal retrieval | AudioCaps (test) | R@159.1 | 23 | |
| Audio-to-Text Retrieval | AudioCaps | R@145.1 | 22 | |
| Zero-shot Retrieval (T+V → A) | AudioCaps | Recall@195.2 | 14 | |
| Zero-shot Retrieval (T+A → V) | AudioCaps | Recall@189 | 14 | |
| Zero-shot Retrieval (T → A+V) | AudioCaps | Recall@145.8 | 14 | |
| Text-to-text retrieval | AudioCaps | Recall@150.3 | 13 | |
| Audio Editing | AudioCaps | R-MOS4.43 | 12 | |
| Audio Question Answering | AudioCaps-QA (test) | Model-as-Judge Score60.77 | 12 | |
| environment-aware text-to-speech | AudioCaps (test) | WER6.76 | 11 | |
| Audio Question Answering | AudioCaps (test) | Token-Level Accuracy60.1 | 11 | |
| Audio Understanding | AudioCaps | LB Score42.82 | 11 | |
| Text-to-audio generation | AudioCaps (evaluation) | FAD1.85 | 11 | |
| Text-to-Audio | AudioCaps 2019 (test) | FAD1.558 | 10 | |
| Video-to-Audio Retrieval | AudioCaps V→A | Recall@188.3 | 10 | |
| Text-to-Video Retrieval | AudioCaps T→V | Recall@120.8 | 10 | |
| Text-to-audio | AudioCaps | FD (OpenL3)1.86 | 10 | |
| Text-to-Audio Retrieval | AudioCaps 1K 1.0 (test) | Recall@152 | 10 |