| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Text-to-Video Retrieval | MSVD | R@171.9 | 218 | |
| Text-to-Video Retrieval | MSVD (test) | R@12,030 | 204 | |
| Video Captioning | MSVD | CIDEr195.6 | 128 | |
| Video Captioning | MSVD (test) | CIDEr189.4 | 111 | |
| Video Question Answering | MSVD | Accuracy79.5 | 100 | |
| Video-to-Text Retrieval | MSVD | R@188.4 | 93 | |
| Video-to-Text Retrieval | MSVD (test) | R@183.1 | 61 | |
| Open-ended Video Question Answering | MSVD-QA | Accuracy79.9 | 59 | |
| Video Question Answering | MSVD (test) | Accuracy76.4 | 30 | |
| Open Ended Question Answering | MSVD | Accuracy73.92 | 22 | |
| Video-Text Retrieval | MSVD | GFLOPS267.8 | 18 | |
| Text-to-Video Retrieval | MSVD (val) | Recall@151.8 | 15 | |
| Video Captioning | MSVD-CTN (test) | ROUGE-L31.46 | 10 | |
| Text-to-Video Retrieval | MSVD zero-shot | Recall@149.9 | 8 | |
| Text Retrieval | MSVD | R@161.5 | 8 | |
| Text-to-Video Retrieval | MSVD 43 (val) | Recall@150 | 7 | |
| Video Understanding | MSVD | Accuracy70.4 | 6 | |
| Video Captioning | MSVD | METEOR51.2 | 6 | |
| Text-to-video retrieval | MSVD 10s (test) | R@139.3 | 6 | |
| Video Captioning | MSVD Cap | CIDEr118.2 | 4 | |
| Video Question Answering | MSVD Open Ended (OE) | Accuracy48.9 | 4 | |
| Video Captioning | MSVD | SB84.32 | 4 | |
| Video-to-Text Retrieval | MSVD 43 (val) | R@168.7 | 4 | |
| Text-to-Video Retrieval | MSVD (standard) | Recall@158.4 | 3 |