| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Audio-visual event localization | AVE (test) | Accuracy83.5 | 37 | |
| Audio-Visual Event Localization | AVE | Accuracy81.1 | 35 | |
| Audio-visual event recognition | AVE (test) | AV Accuracy71.64 | 20 | |
| Multimodal Classification | AVE (test) | Multi Acc65.1 | 14 | |
| Multimodal Classification | AVE | Accuracy (%)73.82 | 12 | |
| Video Saliency Prediction | AVE (test) | AUC-J88.53 | 7 | |
| Continual audio-visual sound separation | AVE | SDR3.55 | 6 | |
| Direction Prediction | AVE (test) | Accuracy (10-class)38.9 | 6 | |
| Audio-Visual Classification | AVE | AV Score71.64 | 6 | |
| Emergent modality binding (vi -> te -> au) | AVE (test) | mAP18.1 | 5 | |
| Emergent modality binding (au -> te -> vi) | AVE (test) | mAP0.168 | 5 | |
| Image-to-Audio Retrieval | AVE | mAP4.13 | 4 | |
| Audio-to-Image Retrieval | AVE | mAP4.11 | 4 | |
| Audio localization from visual segment query | AVE | V2A35.8 | 4 | |
| Audio-Visual Event Classification | AVE | Accuracy0.934 | 4 | |
| Audio-Image Retrieval | AVE (test) | mAP4.46 | 4 | |
| Text-to-Audio Retrieval | AVE | Accuracy28.7 | 3 | |
| Audio-to-Text Retrieval | AVE | Accuracy33.1 | 3 | |
| Supervised Event Localization | AVE | Audio-only Accuracy82.3 | 3 | |
| Audio Source Separation | AVE | Human Preference Score68 | 1 |