| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Video-to-Audio Generation | VGGSound (test) | FAD0.52 | 95 | |
| Single-source sound localization | VGGSound single-source (test) | IoU@0.560.2 | 39 | |
| Audio-visual Zero-Shot Classification | VGGSound GZSL (test) | S Score29.96 | 38 | |
| Multi-sound source localization | VGGSound-Duet (test) | CIoU@0.377.6 | 37 | |
| Audio-visual Classification | VGGSound | Top-1 Acc69.8 | 37 | |
| Video Classification | VGGSound-C unimodal (test) | Accuracy (Gaussian)53.14 | 25 | |
| Classification | VGGSound-C (test) | Error Rate (Gauss.)6.2 | 24 | |
| Audio-Visual Event Classification | VGGSound (test) | Fusion Top-1 Acc69.1 | 23 | |
| Video-to-audio generation | VGGSound | FD_VGG0.97 | 22 | |
| Multimodal Event Classification | VGGSound-C severity level 5 (test) | Gauss. Corruption Accuracy54.9 | 20 | |
| Video-to-Audio | VGGSound (test) | FD (PaSST)47.38 | 20 | |
| Multimodal Retrieval | VGGSound-S (test) | Recall@1 (Video -> Text)6.8 | 19 | |
| Event Classification (A → V) | VGGSound-AVEL 90K | Precision67 | 15 | |
| Event Classification (V → A) | VGGSound-AVEL 40K | Precision75.3 | 15 | |
| Video Retrieval | VGGSound | R@133.5 | 15 | |
| Audio-Visual Captioning | VGGSound Animal | Cs Score51.52 | 14 | |
| Video Classification | VGGSound-C severity level 5 | Accuracy (Gaussian Blur)54.7 | 14 | |
| Zero-shot Classification (A+V → T) | VGGSound | Zero-shot Accuracy52.7 | 14 | |
| Audio-visual Recognition | VGGSound GZSL | S Score48.33 | 14 | |
| Task-wise classification accuracy | VGGSound-2C bimodal (test) | Accuracy (Gaussian)43.74 | 14 | |
| Audio-to-Video Retrieval | VGGSound (test) | Recall@134.9 | 13 | |
| Multi-source sound localization | VGGSound Instruments (test) | CIoU@0.189.6 | 13 | |
| Single-source sound localization | VGGSound Instruments (test) | IoU@0.369.5 | 13 | |
| Audio-visual classification | VGGSound Music | Top-1 Accuracy71.57 | 12 | |
| Event Localization (A → V) | VGGSound AVEL 90K | Segment-level Accuracy70.4 | 11 |