| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Text-to-Video Retrieval | MSVD | R@177.08 | 290 | |
| Text-to-Video Retrieval | MSVD (test) | R@12,030 | 211 | |
| Video Captioning | MSVD | CIDEr195.6 | 157 | |
| Video Question Answering | MSVD | Accuracy79.5 | 152 | |
| Video-to-Text Retrieval | MSVD | R@188.4 | 119 | |
| Video Captioning | MSVD (test) | CIDEr189.4 | 111 | |
| Video-to-Text Retrieval | MSVD (test) | R@183.1 | 68 | |
| Open-ended Video Question Answering | MSVD-QA | Accuracy79.9 | 59 | |
| Video Understanding | MSVD | Accuracy71.6 | 39 | |
| Video Question Answering | MSVD (test) | Accuracy76.4 | 30 | |
| Video-Text Retrieval | MSVD | R@162.7 | 29 | |
| Text-to-Video Retrieval | MSVD zero-shot | Recall@183.3 | 26 | |
| Open Ended Question Answering | MSVD | Accuracy73.92 | 22 | |
| Text-to-Video Retrieval | MSVD (val) | Recall@151.8 | 15 | |
| Video-to-Text Retrieval | MSVD MLVP (test) | Score (Gaussian Noise)36.57 | 14 | |
| Text-to-Video Retrieval | MSVD 1kA (test) | OCR35.07 | 14 | |
| Video Captioning | MSVD-CTN (test) | ROUGE-L31.46 | 10 | |
| Emotional video captioning | EVC-MSVD | Accuracy (SW)91.3 | 9 | |
| Text Retrieval | MSVD | R@161.5 | 8 | |
| Text-to-Video Retrieval | MSVD 43 (val) | Recall@150 | 7 | |
| Vehicle Action Understanding | MSVD | BLEU-10.88 | 6 | |
| Video Captioning | MSVD | METEOR51.2 | 6 | |
| Text-to-video retrieval | MSVD 10s (test) | R@139.3 | 6 | |
| Video Classification | MSVD | Accuracy48.4 | 5 | |
| Video Understanding | MSVD | Accuracy69.8 | 4 |