| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Video Question Answering | MSRVTT-QA | Accuracy72.4 | 491 | |
| Video Question Answering | MSRVTT-QA (test) | Accuracy88.2 | 376 | |
| Text-To-Video retrieval | MSRVTT (test) | Recall@118.2 | 155 | |
| Text-to-Video Retrieval | MSRVTT | R@163.9 | 116 | |
| Video Captioning | MSRVTT | CIDEr80.3 | 107 | |
| Video Question Answering | MSRVTT | Accuracy66.7 | 100 | |
| Text-to-video retrieval | MSRVTT | R@161 | 75 | |
| Video Captioning | MSRVTT | CIDEr80.3 | 68 | |
| Text-to-Video Retrieval | MSRVTT 1k (test) | Recall@1087.4 | 63 | |
| Video Captioning | MSRVTT (test) | CIDEr80.5 | 61 | |
| Video Question Answering | MSRVTT-MC | Accuracy97.7 | 61 | |
| Text-to-Video Retrieval | MSRVTT | Recall@149.9 | 59 | |
| Text-to-Video Retrieval | MSRVTT (1K-A) | R@149.3 | 42 | |
| Video Generation | MSRVTT (val) | FVD414 | 40 | |
| Text-to-Video Retrieval | MSRVTT | Recall@151 | 38 | |
| Video-to-Text Retrieval | MSRVTT | R@149.2 | 35 | |
| Text-to-Video Retrieval | MSRVTT (UTD) | Recall@131.1 | 34 | |
| Text-to-Video Retrieval | MSRVTT full (test val) | Recall@143.6 | 34 | |
| Video Question Answering | MSRVTT-MC (test) | Accuracy97.8 | 31 | |
| Text-to-Video Retrieval | MSRVTT (MSR) zero-shot | R@143.3 | 30 | |
| Video Question Answering | MSRVTT (test) | Accuracy92.7 | 26 | |
| Video-to-Text Retrieval | MSRVTT | R@150.1 | 24 | |
| Text-to-Video Retrieval | MSRVTT 1K-A (test) | R@154.2 | 23 | |
| Text-to-Video Retrieval | MSRVTT 1K 1.0 (test) | R@140.9 | 23 | |
| Video Understanding | MSRVTT | Acc57.7 | 21 |