| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Multimodal Perception | VStar | Accuracy92.67 | 18 | |
| Visual Perception | VStar (test) | Accuracy92.7 | 15 | |
| Video-grounded Dialogue Generation | VSTAR (test) | BLEU-10.092 | 9 | |
| Dialogue Topic Segmentation | VSTAR | WinDif0.765 | 7 | |
| Dialogue Scene Segmentation | VSTAR (test) | mIoU53.6 | 7 |