| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Video Question Answering | MSRVTT-QA | Accuracy72.4 | 505 | |
| Video Question Answering | MSRVTT-QA (test) | Accuracy88.2 | 376 | |
| Text-To-Video retrieval | MSRVTT (test) | Recall@546 | 178 | |
| Text-to-Video Retrieval | MSRVTT | R@163.9 | 144 | |
| Video Captioning | MSRVTT | CIDEr80.3 | 107 | |
| Video Question Answering | MSRVTT | Accuracy66.7 | 100 | |
| Text-to-video retrieval | MSRVTT | R@161 | 75 | |
| Video Captioning | MSRVTT | CIDEr80.3 | 68 | |
| Text-to-Video Retrieval | MSRVTT 1k (test) | Recall@1087.4 | 63 | |
| Video Captioning | MSRVTT (test) | CIDEr80.5 | 61 | |
| Video Question Answering | MSRVTT-MC | Accuracy97.7 | 61 | |
| Text-to-Video Retrieval | MSRVTT | Recall@149.9 | 59 | |
| Video Understanding | MSRVTT | Acc57.7 | 43 | |
| Video-to-Text Retrieval | MSRVTT 1kA severity degree 2 | Performance (Gaussian Noise)44.2 | 42 | |
| Text-to-Video Retrieval | MSRVTT (1K-A) | R@149.3 | 42 | |
| Video Generation | MSRVTT (val) | FVD414 | 40 | |
| Text-to-Video Retrieval | MSRVTT | Recall@151 | 38 | |
| Video-to-Text Retrieval | MSRVTT | R@149.2 | 35 | |
| Text-to-Video Retrieval | MSRVTT (UTD) | Recall@131.1 | 34 | |
| Text-to-Video Retrieval | MSRVTT full (test val) | Recall@143.6 | 34 | |
| Video Question Answering | MSRVTT-MC (test) | Accuracy97.8 | 31 | |
| Text-to-Video Retrieval | MSRVTT (MSR) zero-shot | R@143.3 | 30 | |
| Video-to-Text Retrieval | MSRVTT | Recall@140.4 | 28 | |
| Video-to-Text Retrieval | MSRVTT v2t 1.0 | Performance (Gaussian Noise)24.7 | 28 | |
| Video Question Answering | MSRVTT (test) | Accuracy92.7 | 26 |