| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Audio-visual understanding | WorldSense | Accuracy66.4 | 42 | |
| Video Understanding | WorldSense | Score52.01 | 25 | |
| Multimodal Fact-Level Attribution | WorldSense 1.0 (sampled examples) | Accuracy71.4 | 24 | |
| Long Audio-Video Question Answering | WorldSense | Average Accuracy61.2 | 18 | |
| Audio-visual Question Answering | WorldSense | Accuracy50 | 18 | |
| Audio-Visual Perception | WorldSense | Score47.4 | 8 | |
| Video Reasoning | WorldSense | Accuracy40.4 | 7 | |
| Commonsense Reasoning | WorldSense | Overall Score42.6 | 7 | |
| Audio-Visual Question | WorldSense | Accuracy (Clean)59.7 | 6 | |
| Video Grounded Reasoning | WorldSense | Original Score45.4 | 6 | |
| Common Sense Reasoning | WorldSense | Accuracy0.637 | 6 | |
| Video Question Answering | WorldSense | Accuracy49.2 | 5 | |
| Visual Question Answering | WorldSense sampled examples 1.0 | Accuracy60 | 4 |