| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Audio-to-Text Retrieval | Clotho (test) | R@138.6 | 78 | |
| Text-to-Audio Retrieval | Clotho (test) | R@128.3 | 62 | |
| Audio Captioning | Clotho | CIDEr50.9 | 60 | |
| Audio Captioning | Clotho 2.1 (test) | CIDEr0.496 | 31 | |
| Cross-modal retrieval | Clotho (test) | R@146.4 | 29 | |
| Audio Captioning | Clotho (test) | METEOR19.7 | 21 | |
| Audio Question and Answering | ClothoAQA | Accuracy85.6 | 20 | |
| Audio Retrieval | Clotho | R@123.7 | 20 | |
| Text-to-Audio Generation | Clotho (test) | FID17.23 | 17 | |
| Text-to-Audio Retrieval | Clotho T→A | Recall@124 | 15 | |
| Text-to-Audio Retrieval | Clotho V1 | R@125.3 | 15 | |
| Text-to-audio Retrieval | Clotho V2 (test) | R@14.61 | 13 | |
| Audio-to-text Retrieval | Clotho V2 (test) | Recall@118.78 | 13 | |
| Text-to-Audio Retrieval | Clotho V2 | R@1 (%)27.2 | 13 | |
| Automated Audio Captioning | Clotho 2.1 (evaluation) | SPIDEr33.4 | 12 | |
| Automated Audio Captioning | Clotho (evaluation) | SPIDEr33.2 | 10 | |
| Text-to-Audio Retrieval | Clotho 1K 1.0 (test) | R@126.9 | 10 | |
| Audio Captioning | Clotho (eval) | SPIDEr31.88 | 9 | |
| Audio Captioning | Clotho V2 | CIDEr51.9 | 9 | |
| Audio-to-Text Retrieval | Clotho 1K 1.0 (test) | R@127.1 | 8 | |
| Audio Captioning | Clotho V1 | B@418.5 | 8 | |
| Audio Understanding | ClothoAQA | Accuracy75.16 | 7 | |
| Audio Question Answering | Clotho AQA | Score85.6 | 7 | |
| Audio Understanding | Clotho V2 | CIDEr25.1 | 6 | |
| Audio Classification/Retrieval | Clotho | Score0.042 | 6 |