| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Audio-visual understanding | WorldSense | Accuracy66.4 | 72 | |
| Audio-Visual Reasoning | WorldSense | Score54.3 | 32 | |
| Video Understanding | WorldSense | Score52.01 | 25 | |
| Omnimodal Understanding | WorldSense v1.0 (test) | Tech & Science Score52.65 | 24 | |
| Multimodal Fact-Level Attribution | WorldSense 1.0 (sampled examples) | Accuracy71.4 | 24 | |
| Common Sense Reasoning | WorldSense | Accuracy64.6 | 19 | |
| Long Audio-Video Question Answering | WorldSense | Average Accuracy61.2 | 18 | |
| Audio-visual Question Answering | WorldSense | Accuracy50 | 18 | |
| Multi-modal Understanding | WorldSense | WorldSense Performance46.85 | 14 | |
| Long Video Reasoning | WorldSense | Overall Accuracy52.5 | 13 | |
| Omni-modal Understanding | WorldSense | Accuracy48 | 12 | |
| Video Question Answering | WorldSense | Accuracy (Tech & Science)48.78 | 10 | |
| Video Understanding | WorldSense | TFLOPs12 | 8 | |
| Video Understanding | WorldSense (test) | Overall Accuracy42.6 | 8 | |
| Audio-Visual Perception | WorldSense | Score47.4 | 8 | |
| Video Reasoning | WorldSense | Accuracy40.4 | 7 | |
| Commonsense Reasoning | WorldSense | Overall Score42.6 | 7 | |
| Audio-Visual Question | WorldSense | Accuracy (Clean)59.7 | 6 | |
| Video Grounded Reasoning | WorldSense | Original Score45.4 | 6 | |
| Video Question Answering | WorldSense | Accuracy49.2 | 5 | |
| Visual Question Answering | WorldSense sampled examples 1.0 | Accuracy60 | 4 | |
| Text Query QA | WorldSense | Score65.5 | 3 |