| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Text-to-Audio Generation | One-to-one evaluation benchmarks Text-to-Audio | FAD1.41 | 6 | |
| Text-to-Image Generation | One-to-one evaluation benchmarks Text-to-Image | FID17.39 | 6 | |
| Audio-to-Text Generation | one-to-one evaluation benchmarks | CLAP Score45.08 | 5 | |
| Audio-to-Text Generation | One-to-one evaluation benchmarks Audio-to-Text | CIDEr55.11 | 5 | |
| Image-to-Text Generation | One-to-one evaluation benchmarks Image-to-Text | CIDEr134.7 | 5 | |
| Audio-to-Image Generation | One-to-one evaluation benchmarks Audio-to-Image | FID26.6 | 4 | |
| Image-to-Audio Generation | One-to-one evaluation benchmarks Image-to-Audio | FAD2.5 | 4 |