Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

VGGSound

Benchmarks

Task NameDataset NameSOTA ResultTrend
Video-to-Audio GenerationVGGSound (test)
FAD0.52
83
Single-source sound localizationVGGSound single-source (test)
IoU@0.560.2
39
Audio-visual Zero-Shot ClassificationVGGSound GZSL (test)
S Score29.96
38
Multi-sound source localizationVGGSound-Duet (test)
CIoU@0.377.6
37
Audio-visual ClassificationVGGSound
Top-1 Acc69.8
37
Video ClassificationVGGSound-C unimodal (test)
Accuracy (Gaussian)53.14
25
ClassificationVGGSound-C (test)
Error Rate (Gauss.)6.2
24
Audio-Visual Event ClassificationVGGSound (test)
Fusion Top-1 Acc69.1
23
Video-to-audio generationVGGSound
FD_VGG0.97
22
Multimodal Event ClassificationVGGSound-C severity level 5 (test)
Gauss. Corruption Accuracy54.9
20
Video-to-AudioVGGSound (test)
FD (PaSST)47.38
20
Multimodal RetrievalVGGSound-S (test)
Recall@1 (Video -> Text)6.8
19
Video RetrievalVGGSound
R@133.5
15
Video ClassificationVGGSound-C severity level 5
Accuracy (Gaussian Blur)54.7
14
Zero-shot Classification (A+V → T)VGGSound
Zero-shot Accuracy52.7
14
Audio-visual RecognitionVGGSound GZSL
S Score48.33
14
Task-wise classification accuracyVGGSound-2C bimodal (test)
Accuracy (Gaussian)43.74
14
Audio-to-Video RetrievalVGGSound (test)
Recall@134.9
13
Multi-source sound localizationVGGSound Instruments (test)
CIoU@0.189.6
13
Single-source sound localizationVGGSound Instruments (test)
IoU@0.369.5
13
Audio-visual classificationVGGSound Music
Top-1 Accuracy71.57
12
Audio-visual localizationVGGSound (Unheard 110 categories)
cIoU48.4
11
Audio-visual localizationVGGSound (Heard 110 categories)
cIoU54.85
11
Video-to-Audio RetrievalVGGSound (test)
Recall@133.5
11
Text-to-AudioVGGSound-Omni (test)
KL Divergence1.35
10
Showing 25 of 68 rows