| Dataset Name | SOTA Method | Metric | Trend | ||
|---|---|---|---|---|---|
| AudioCaps (test) | LAMB | CIDEr91.1 | 140 | 1mo ago | |
| Clotho | CIDEr50.9 | 60 | 1mo ago | ||
| AudioCaps | EnCLAP-large | CIDEr80.3 | 49 | 4d ago | |
| Clotho 2.1 (test) | MQ-Cap | CIDEr0.496 | 31 | 1mo ago | |
| Clotho (test) | METEOR19.7 | 21 | 1mo ago | ||
| Clotho V2 | AF-Next-Instruct | CIDEr52 | 11 | 4d ago | |
| AudioCaps AudioSet (test) | HTSAT-BART | SPIDEr48.5 | 10 | 1mo ago | |
| ASFx (eval) | Whisper-Cards | SPIDEr19.36 | 9 | 1mo ago | |
| Clotho (eval) | Audio Flamingo 3 | SPIDEr31.88 | 9 | 1mo ago | |
| ParaSpeechCaps (PSC) | Multi-Task (Ours) | Captioning Score46.01 | 8 | 20d ago | |
| MusicCaps | PAR (Ours-UTS) | Captioning Score23.33 | 8 | 20d ago | |
| Clotho V1 | VAST | B@418.5 | 8 | 1mo ago | |
| DCASE Task 6 2020 (dev test) | Ensemble | BLEU-153.7 | 6 | 1mo ago | |
| Auto-ACD (test) | MQ-Cap | CIDEr70.4 | 6 | 1mo ago | |
| AudioSet | SoundAtlas | LA-CLAP0.447 | 4 | 1mo ago | |
| Song Describer (SD) | SBERT Similarity0.469 | 4 | 22d ago | ||
| MusicCaps (MC) non-vocal | SBERT Similarity0.478 | 4 | 1mo ago | ||
| Clotho Caption | Wavcaps | CIDEr48.8 | 4 | 1mo ago | |
| AudioCaps (val) | Ours | CIDEr64 | 4 | 1mo ago | |
| Clotho Caps (test) | X-InstructBLIP | Score29.4 | 4 | 1mo ago | |
| AutoACD | MiDashengLM | FENSE Score66.52 | 3 | 22d ago | |
| MECAT | MiDashengLM | FENSE Content Long Score60.11 | 3 | 22d ago | |
| MUGEN-GAME (unseen) | MMGPT | BLEU-46.7 | 3 | 1mo ago | |
| AudioCaps | SoundAtlas | MWR-S (MLLM)0.75 | 3 | 1mo ago | |
| VGGSound | SoundAtlas | LA-CLAP0.461 | 3 | 1mo ago |