| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| text-to-audio retrieval | AudioCaps (test) | Recall@166.59 | 145 | |
| Audio Captioning | AudioCaps (test) | CIDEr91.1 | 140 | |
| Text-to-audio generation | AudioCaps (test) | FAD0.77 | 138 | |
| Audio-to-text Retrieval | AudioCaps (test) | R@165.6 | 62 | |
| Audio Captioning | AudioCaps | CIDEr80.3 | 47 | |
| Audio Retrieval | AudioCaps | R@152 | 42 | |
| Cross-modal retrieval | AudioCaps (test) | R@159.1 | 23 | |
| Text-to-audio retrieval | AudioCaps | Recall@155.2 | 19 | |
| Zero-shot Retrieval (T+V → A) | AudioCaps | Recall@195.2 | 14 | |
| Zero-shot Retrieval (T+A → V) | AudioCaps | Recall@189 | 14 | |
| Zero-shot Retrieval (T → A+V) | AudioCaps | Recall@145.8 | 14 | |
| Audio Editing | AudioCaps | R-MOS4.43 | 12 | |
| Audio Question Answering | AudioCaps-QA (test) | Model-as-Judge Score60.77 | 12 | |
| Video-to-Audio Retrieval | AudioCaps V→A | Recall@188.3 | 10 | |
| Text-to-Video Retrieval | AudioCaps T→V | Recall@120.8 | 10 | |
| Text-to-Audio Retrieval | AudioCaps 1K 1.0 (test) | Recall@152 | 10 | |
| Audio Captioning | AudioCaps AudioSet (test) | SPIDEr48.5 | 10 | |
| Automated Audio Captioning | AudioCaps (evaluation) | SPIDEr51.8 | 9 | |
| Neural Audio Compression | AudioCaps (test) | FAD96.926 | 8 | |
| Audio-to-Text Retrieval | AudioCaps 1K 1.0 (test) | R@152.4 | 8 | |
| Retrieval | AudioCaps T+V -> A | Recall@195.2 | 6 | |
| Sound Reconstruction | AudioCaps (val) | VISQOL Score3.2 | 6 | |
| Text-to-Audio | AudioCaps multi-event prompts | FDopenl375.2 | 5 | |
| Text-to-audio infilling | AudioCaps (test) | IS (Inception Score)13.28 | 5 | |
| Audio-Text Retrieval | AudioCaps (val) | mAP21.98 | 5 |