| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Text-to-Video Retrieval | MSR-VTT | Recall@164.3 | 313 | |
| Text-to-Video Retrieval | MSR-VTT (test) | R@1990 | 234 | |
| Text-to-Video Retrieval | MSR-VTT (1k-A) | R@1090.6 | 211 | |
| Video-to-Text Retrieval | MSR-VTT | Recall@164.8 | 157 | |
| Video Captioning | MSR-VTT (test) | CIDEr104.2 | 121 | |
| Text-to-Video Generation | MSR-VTT (test) | CLIP Similarity0.3123 | 85 | |
| Video-to-Text Retrieval | MSR-VTT (1k-A) | Recall@584.1 | 74 | |
| Text-to-Video Retrieval | MSR-VTT 1k-A (test) | R@148.5 | 57 | |
| Text-to-Video Retrieval | MSR-VTT (9K) | R@152 | 55 | |
| Text-to-Video Retrieval | MSR-VTT (Full) | R@134.3 | 55 | |
| Text-to-Video Retrieval | MSR-VTT 1K (test) | R@155.9 | 45 | |
| Video-to-Text Retrieval | MSR-VTT 9K | R@147.7 | 43 | |
| Video Question Answering | MSR-VTT | Accuracy94.4 | 42 | |
| Video-to-Text Retrieval | MSR-VTT 1K (test) | R@153.7 | 39 | |
| Text-to-Video Retrieval | MSR-VTT 1K (val) | R@153.3 | 38 | |
| Video-to-Text Retrieval | MSR-VTT (Full) | Recall@164.7 | 38 | |
| Text-to-Video Generation | MSR-VTT | CLIPSIM0.3204 | 28 | |
| Text-to-Video Retrieval | MSR-VTT 7K | Recall@1082.8 | 27 | |
| Text-to-Video Retrieval | MSR-VTT 1K videos (test) | Recall@1075.1 | 25 | |
| Text-to-Video Retrieval | MSR-VTT Official full-size (test) | R@148.8 | 24 | |
| Cross-modal retrieval (Audio) | MSR-VTT | R@142 | 22 | |
| Text-to-Video Generation | MSR-VTT zero-shot | CLIPSIM32.04 | 20 | |
| Video Retrieval | MSR-VTT | R@157.7 | 19 | |
| Audio-to-Visual Retrieval | MSR-VTT (test) | R@1150 | 18 | |
| Text-to-Video Retrieval | MSR-VTT 1k-Yu (test) | R@132.4 | 18 |