| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Video Understanding | OmniVideoBench | Score40.3 | 32 | |
| Audio-Video Understanding | OmniVideoBench | Avg Latency29.2 | 23 | |
| Omnimodal Question Answering | OmniVideoBench 1.0 (test) | Compare Attr44.44 | 18 | |
| Audio-visual Question Answering | OmniVideoBench | Accuracy0.356 | 18 | |
| Fine-grained audio-visual video understanding | OmniVideoBench | Accuracy58.9 | 12 | |
| Audio-Visual Joint Reasoning | OmniVideoBench | Music Score56.2 | 11 | |
| Video Reasoning | OmniVideoBench | Accuracy (Long)40.52 | 8 | |
| Omni-modal collaborative reasoning | OmniVideoBench | Top-1 Accuracy40.5 | 6 |