| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Audio-Visual Question Answering | MUSIC-AVQA 1.0 (test) | AV Localis Accuracy85.09 | 96 | |
| Audio-Visual Question Answering | MUSIC-AVQA (test) | Acc (Avg)80.7 | 59 | |
| Audio Question Answering | MUSIC-AVQA 1.0 (test) | Counting Accuracy84.86 | 43 | |
| Audio-Visual Question Answering | Music-AVQA | Accuracy81.3 | 21 | |
| Overall Audio-Visual Question Answering | MUSIC-AVQA (test) | Overall Accuracy71.52 | 21 | |
| Audio-Video Question Answering | MUSIC-AVQA | AV Temporal Acc51.77 | 19 | |
| Audio-Visual Question Answering | MUSIC-AVQA Bias v2.0 (test) | Total Accuracy77.33 | 18 | |
| Audio-Visual Question Answering | MUSIC-AVQA balanced v2.0 (test) | Total Accuracy75.44 | 18 | |
| Audio Question Answering | MUSIC-AVQA (test) | Accuracy (Avg)80.51 | 17 | |
| Visual Question Answering | MUSIC-AVQA v1.0 (test) | Accuracy (Count)0.8396 | 16 | |
| Audio-Visual Question Answering | MUSIC-AVQA-R (test) | Audio QA Count (Head)82.67 | 13 | |
| Visual Question Answering | MUSIC-AVQA (test) | Accuracy (Counting)71.56 | 12 | |
| Audio-Visual Question Answering | MUSIC-AVQA balanced (test) | Existential Score83.62 | 8 | |
| Audio-Visual Question Answering | Music-AVQA 2000 samples | ASR Rate13.8 | 7 | |
| Audio Visual Question Answering | Music-AVQA | Music-AVQA Clean Accuracy80.7 | 7 | |
| Audio-Visual Question Answering | Music-AVQA 30 (test) | Overall Accuracy84.3 | 7 | |
| Audio-Visual Question Answering | MUSIC-AVQA 2.0 (test) | Accuracy (Audio, Count)83.82 | 4 | |
| Audio-Visual Question Answering | MUSIC-AVQA Contrasting Binary QA pairs v2.0 | Total Accuracy58.86 | 4 | |
| Video Question Answering | MUSIC-AVQA | Accuracy80.7 | 2 |