| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Video-to-Audio Generation | VGGSound (test) | FAD0.75 | 62 | |
| Audio-visual Zero-Shot Classification | VGGSound GZSL (test) | S Score29.96 | 38 | |
| Video Classification | VGGSound-C unimodal (test) | Accuracy (Gaussian)53.14 | 25 | |
| Classification | VGGSound-C (test) | Error Rate (Gauss.)6.2 | 24 | |
| Audio-visual Classification | VGGSound | Top-1 Acc69.8 | 24 | |
| Single-source sound localization | VGGSound single-source (test) | IoU@0.553.7 | 23 | |
| Multi-sound source localization | VGGSound-Duet (test) | CIoU@0.346.9 | 23 | |
| Multimodal Event Classification | VGGSound-C severity level 5 (test) | Gauss. Corruption Accuracy54.9 | 20 | |
| Audio-Visual Event Classification | VGGSound (test) | Fusion Top-1 Acc65.8 | 18 | |
| Video Retrieval | VGGSound | R@133.5 | 15 | |
| Zero-shot Classification (A+V → T) | VGGSound | Zero-shot Accuracy52.7 | 14 | |
| Audio-visual Recognition | VGGSound GZSL | S Score48.33 | 14 | |
| Task-wise classification accuracy | VGGSound-2C bimodal (test) | Accuracy (Gaussian)43.74 | 14 | |
| Multi-source sound localization | VGGSound Instruments (test) | CIoU@0.189.6 | 13 | |
| Single-source sound localization | VGGSound Instruments (test) | IoU@0.369.5 | 13 | |
| Audio-visual classification | VGGSound Music | Top-1 Accuracy71.57 | 12 | |
| Text-to-Audio | VGGSound-Omni (test) | KL Divergence1.35 | 10 | |
| Cross-modal Generation | VGGSound | Average Score87.23 | 9 | |
| Video-to-Audio | VGGSound (test) | APCC-Δ0.758 | 9 | |
| Sound source localization | VGGSound Source | cIoU40.6 | 9 | |
| Sound Localization | VGGSound Single 1.0 (test) | IoU@0.540.8 | 9 | |
| Sound Localization | VGGSound-Instruments 1.0 (test) | IoU@0.355.3 | 9 | |
| Multi-source sound localization | VGGSound-Duet | CIoU@0.326.2 | 9 | |
| Multi-source sound localization | VGGSound Instruments | CIoU@0.177.5 | 9 | |
| Zero-shot Classification (A → T) | VGGSound | Accuracy47.1 | 8 |