| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Audio-to-Text Retrieval | Clotho (test) | R@138.6 | 85 | |
| Audio Captioning | Clotho | CIDEr50.9 | 82 | |
| Text-to-Audio Retrieval | Clotho (test) | R@128.3 | 78 | |
| Audio-to-Text Retrieval | Clotho | R@126.5 | 49 | |
| Audio Captioning | Clotho (test) | METEOR19.7 | 43 | |
| Audio Retrieval | Clotho | R@123.7 | 33 | |
| Text-to-Audio Retrieval | Clotho | R@10.212 | 31 | |
| Audio Captioning | Clotho 2.1 (test) | CIDEr0.496 | 31 | |
| Cross-modal retrieval | Clotho (test) | R@146.4 | 29 | |
| Audio Question and Answering | ClothoAQA | Accuracy85.6 | 20 | |
| Text-to-Audio Generation | Clotho (test) | FID17.23 | 17 | |
| Text-to-Audio Retrieval | Clotho T→A | Recall@124 | 15 | |
| Text-to-Audio Retrieval | Clotho V1 | R@125.3 | 15 | |
| Audio Hallucination Evaluation | Clotho-1K | HR16.98 | 14 | |
| Audio Understanding | ClothoAQA | Accuracy75.16 | 14 | |
| Text-to-text retrieval | Clotho | R@164.52 | 13 | |
| Text-to-Audio Retrieval | Clotho (evaluation) | R@122.87 | 13 | |
| Text-to-audio Retrieval | Clotho V2 (test) | R@14.61 | 13 | |
| Audio-to-text Retrieval | Clotho V2 (test) | Recall@118.78 | 13 | |
| Text-to-Audio Retrieval | Clotho V2 | R@1 (%)27.2 | 13 | |
| Automated Audio Captioning | Clotho | AAC Score55.92 | 12 | |
| Automated Audio Captioning | Clotho 2.1 (evaluation) | SPIDEr33.4 | 12 | |
| Audio Question Answering | Clotho (test) | Token-Level Accuracy52.8 | 11 | |
| Audio Captioning | Clotho V2 | CIDEr52 | 11 | |
| Watermark Detection | Clotho 1.0 (test) | Perth100 | 10 |