| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Video-to-Audio Generation | VGGSound (test) | FAD0.52 | 83 | |
| Single-source sound localization | VGGSound single-source (test) | IoU@0.560.2 | 39 | |
| Audio-visual Zero-Shot Classification | VGGSound GZSL (test) | S Score29.96 | 38 | |
| Multi-sound source localization | VGGSound-Duet (test) | CIoU@0.377.6 | 37 | |
| Audio-visual Classification | VGGSound | Top-1 Acc69.8 | 37 | |
| Video Classification | VGGSound-C unimodal (test) | Accuracy (Gaussian)53.14 | 25 | |
| Classification | VGGSound-C (test) | Error Rate (Gauss.)6.2 | 24 | |
| Audio-Visual Event Classification | VGGSound (test) | Fusion Top-1 Acc69.1 | 23 | |
| Video-to-audio generation | VGGSound | FD_VGG0.97 | 22 | |
| Multimodal Event Classification | VGGSound-C severity level 5 (test) | Gauss. Corruption Accuracy54.9 | 20 | |
| Video-to-Audio | VGGSound (test) | FD (PaSST)47.38 | 20 | |
| Multimodal Retrieval | VGGSound-S (test) | Recall@1 (Video -> Text)6.8 | 19 | |
| Video Retrieval | VGGSound | R@133.5 | 15 | |
| Video Classification | VGGSound-C severity level 5 | Accuracy (Gaussian Blur)54.7 | 14 | |
| Zero-shot Classification (A+V → T) | VGGSound | Zero-shot Accuracy52.7 | 14 | |
| Audio-visual Recognition | VGGSound GZSL | S Score48.33 | 14 | |
| Task-wise classification accuracy | VGGSound-2C bimodal (test) | Accuracy (Gaussian)43.74 | 14 | |
| Audio-to-Video Retrieval | VGGSound (test) | Recall@134.9 | 13 | |
| Multi-source sound localization | VGGSound Instruments (test) | CIoU@0.189.6 | 13 | |
| Single-source sound localization | VGGSound Instruments (test) | IoU@0.369.5 | 13 | |
| Audio-visual classification | VGGSound Music | Top-1 Accuracy71.57 | 12 | |
| Audio-visual localization | VGGSound (Unheard 110 categories) | cIoU48.4 | 11 | |
| Audio-visual localization | VGGSound (Heard 110 categories) | cIoU54.85 | 11 | |
| Video-to-Audio Retrieval | VGGSound (test) | Recall@133.5 | 11 | |
| Text-to-Audio | VGGSound-Omni (test) | KL Divergence1.35 | 10 |