| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Audio-visual understanding | Daily-Omni | Accuracy82.8 | 58 | |
| Audio-Visual Dialogue | Daily-Omni | Score71.9 | 32 | |
| Video Understanding | Daily-Omni | Daily Score57.7 | 20 | |
| Audiovisual Understanding & Reasoning | Daily-Omni | Score77.9 | 15 | |
| Omnimodal common event understanding | Daily-Omni | Accuracy81.4 | 13 | |
| QA performance by Gemini-2.5-Pro based on captions | Daily-Omni (test) | Daily-Omni QA Score61.2 | 13 | |
| Video Question Answering | Daily-Omni | Score60.2 | 11 | |
| Audio-Visual Question Answering | Daily-Omni 1 FPS | Metric 3070.9 | 8 | |
| Audio-Visual Question Answering | Daily-Omni | Score73.6 | 8 | |
| Audio-Visual Perception | Daily-Omni | Score60.65 | 8 | |
| Omni-modal collaborative reasoning | Daily-Omni | Top-1 Accuracy71.09 | 6 |