| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Audio-visual understanding | WorldSense | Accuracy66.4 | 32 | |
| Multimodal Fact-Level Attribution | WorldSense 1.0 (sampled examples) | Accuracy71.4 | 24 | |
| Long Audio-Video Question Answering | WorldSense | Average Accuracy61.2 | 18 | |
| Audio-visual Question Answering | WorldSense | Accuracy50 | 18 | |
| Audio-Visual Perception | WorldSense | Score47.4 | 8 | |
| Video Understanding | WorldSense | Score52.01 | 8 | |
| Common Sense Reasoning | WorldSense | Accuracy0.637 | 6 | |
| Visual Question Answering | WorldSense sampled examples 1.0 | Accuracy60 | 4 |