| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Video Question Answering | MSRVTT-QA | Accuracy72.4 | 481 | |
| Video Question Answering | MSRVTT-QA (test) | Accuracy88.2 | 371 | |
| Text-To-Video retrieval | MSRVTT (test) | Recall@118.2 | 155 | |
| Video Captioning | MSRVTT | CIDEr80.3 | 101 | |
| Text-to-Video Retrieval | MSRVTT | R@163.9 | 98 | |
| Text-to-video retrieval | MSRVTT | R@161 | 75 | |
| Text-to-Video Retrieval | MSRVTT 1k (test) | Recall@1087.4 | 63 | |
| Video Captioning | MSRVTT (test) | CIDEr80.5 | 61 | |
| Video Captioning | MSRVTT | CIDEr80.3 | 61 | |
| Video Question Answering | MSRVTT-MC | Accuracy97.7 | 61 | |
| Text-to-Video Retrieval | MSRVTT | Recall@149.9 | 48 | |
| Video Question Answering | MSRVTT | Accuracy66.7 | 46 | |
| Text-to-Video Retrieval | MSRVTT (1K-A) | R@149.3 | 42 | |
| Video Generation | MSRVTT (val) | FVD414 | 40 | |
| Text-to-Video Retrieval | MSRVTT (UTD) | Recall@131.1 | 34 | |
| Text-to-Video Retrieval | MSRVTT full (test val) | Recall@143.6 | 34 | |
| Video Question Answering | MSRVTT-MC (test) | Accuracy97.8 | 31 | |
| Text-to-Video Retrieval | MSRVTT (MSR) zero-shot | R@142.6 | 26 | |
| Video Question Answering | MSRVTT (test) | Accuracy92.7 | 26 | |
| Video-to-Text Retrieval | MSRVTT | R@150.1 | 24 | |
| Text-to-Video Retrieval | MSRVTT 1K-A (test) | R@154.2 | 23 | |
| Text-to-Video Retrieval | MSRVTT 1K 1.0 (test) | R@140.9 | 23 | |
| Video captioning | MSRVTT (full) | CIDEr75.9 | 20 | |
| Image-to-Video Retrieval | MSRVTT I2V | Recall@192.4 | 18 | |
| Video-Text Retrieval | MSRVTT | GFLOPS44.7 | 18 |