| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Text-to-Video Retrieval | MSR-VTT | Recall@164.4 | 369 | |
| Text-to-Video Retrieval | MSR-VTT (test) | R@1990 | 255 | |
| Text-to-Video Retrieval | MSR-VTT (1k-A) | R@1090.6 | 211 | |
| Video-to-Text Retrieval | MSR-VTT | Recall@164.8 | 185 | |
| Video Captioning | MSR-VTT (test) | CIDEr104.2 | 128 | |
| Text-to-Video Generation | MSR-VTT (test) | CLIP Similarity0.3123 | 85 | |
| Video-to-Text Retrieval | MSR-VTT (1k-A) | Recall@584.1 | 74 | |
| Text-to-Video Retrieval | MSR-VTT 1K (test) | R@193.61 | 65 | |
| Text-to-Video Retrieval | MSR-VTT 1k-A (test) | R@148.5 | 57 | |
| Text-to-Video Retrieval | MSR-VTT (9K) | R@152 | 55 | |
| Text-to-Video Retrieval | MSR-VTT (Full) | R@134.3 | 55 | |
| Video-to-Text Retrieval | MSR-VTT 9K | R@147.7 | 43 | |
| Video Question Answering | MSR-VTT | Accuracy94.4 | 42 | |
| Video-to-Text Retrieval | MSR-VTT 1K (test) | R@153.7 | 39 | |
| Text-to-Video Retrieval | MSR-VTT 1K (val) | R@153.3 | 38 | |
| Video-to-Text Retrieval | MSR-VTT (Full) | Recall@164.7 | 38 | |
| Video Retrieval | MSR-VTT | R@157.7 | 31 | |
| Text-to-Video Generation | MSR-VTT | CLIPSIM0.3204 | 28 | |
| Text-to-Video Retrieval | MSR-VTT 7K | Recall@1082.8 | 27 | |
| Text-to-Video Generation | MSR-VTT zero-shot | FVD212 | 26 | |
| Text-to-Video Retrieval | MSR-VTT 1K videos (test) | Recall@1075.1 | 25 | |
| Text-to-Video Retrieval | MSR-VTT Official full-size (test) | R@148.8 | 24 | |
| Cross-modal retrieval (Audio) | MSR-VTT | R@142 | 22 | |
| Video-Text Retrieval | MSR-VTT | Recall (Text-to-Video)42.8 | 22 | |
| Cross-modal Retrieval | MSR-VTT (test) | R@1 (V→T)37.3 | 19 |