| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Text-to-audio generation | AudioCaps (test) | FAD0.77 | 154 | |
| text-to-audio retrieval | AudioCaps (test) | Recall@166.59 | 152 | |
| Audio Captioning | AudioCaps (test) | CIDEr91.1 | 140 | |
| Audio-to-text Retrieval | AudioCaps (test) | R@165.6 | 69 | |
| Audio Retrieval | AudioCaps | R@152 | 50 | |
| Audio Captioning | AudioCaps | CIDEr80.3 | 49 | |
| Text-to-audio retrieval | AudioCaps | Recall@155.2 | 35 | |
| Cross-modal retrieval | AudioCaps (test) | R@159.1 | 23 | |
| Zero-shot Retrieval (T+V → A) | AudioCaps | Recall@195.2 | 14 | |
| Zero-shot Retrieval (T+A → V) | AudioCaps | Recall@189 | 14 | |
| Zero-shot Retrieval (T → A+V) | AudioCaps | Recall@145.8 | 14 | |
| Audio Editing | AudioCaps | R-MOS4.43 | 12 | |
| Audio Question Answering | AudioCaps-QA (test) | Model-as-Judge Score60.77 | 12 | |
| Text-to-audio generation | AudioCaps (evaluation) | FAD1.85 | 11 | |
| Video-to-Audio Retrieval | AudioCaps V→A | Recall@188.3 | 10 | |
| Text-to-Video Retrieval | AudioCaps T→V | Recall@120.8 | 10 | |
| Text-to-Audio Retrieval | AudioCaps 1K 1.0 (test) | Recall@152 | 10 | |
| Audio Captioning | AudioCaps AudioSet (test) | SPIDEr48.5 | 10 | |
| Automated Audio Captioning | AudioCaps (evaluation) | SPIDEr51.8 | 9 | |
| Audio Steganography | AudioCaps | BER (Original)0.09 | 8 | |
| Neural Audio Compression | AudioCaps (test) | FAD96.926 | 8 | |
| Audio-to-Text Retrieval | AudioCaps 1K 1.0 (test) | R@152.4 | 8 | |
| Audio-text alignment correlation | AudioCaps (test) | SRCC0.457 | 7 | |
| Retrieval | AudioCaps T+V -> A | Recall@195.2 | 6 | |
| Sound Reconstruction | AudioCaps (val) | VISQOL Score3.2 | 6 |