| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Text-to-Video Retrieval | DiDeMo (test) | R@170.5 | 376 | |
| Text-to-Video Retrieval | DiDeMo | R@132.4 | 360 | |
| Video-to-Text Retrieval | DiDeMo | R@171.9 | 108 | |
| Video-to-Text Retrieval | DiDeMo (test) | R@167.5 | 92 | |
| Text-to-Video Retrieval | DiDeMo (DDM) full (test val) | Recall@146.3 | 34 | |
| Text-to-Video Retrieval | DiDeMo (DDM) zero-shot | R@148.6 | 22 | |
| Retrieval | DiDeMo T+A -> V | Recall@182.1 | 20 | |
| Video Retrieval | DiDeMo | R@146.1 | 18 | |
| Video-Text Retrieval | DIDEMO | GFLOPS44.5 | 18 | |
| Text-to-video retrieval | DiDeMo (UTD-split) | Recall@135.6 | 17 | |
| Text-to-Video Retrieval | DiDeMo 1K videos (test) | R@137 | 16 | |
| Zero-shot Retrieval (T+V → A) | DiDeMo | Recall@10.695 | 14 | |
| Zero-shot Retrieval (T → A+V) | DiDeMo | Recall@153.7 | 14 | |
| Text-to-video retrieval | DiDeMo 28s (test) | R@138.1 | 11 | |
| Video Corpus Moment Retrieval (VCMR) | DiDeMo 14 (test) | Recall@1 (IoU=0.5)2.26 | 11 | |
| Text-to-Video Retrieval | DiDeMo 12 (full-corpus) | R@126 | 8 | |
| Text-to-Video Retrieval | DiDeMo 12 (test) | R@145.3 | 8 | |
| Text-to-Video Retrieval | DiDeMo (val) | R@153.9 | 8 | |
| Video Retrieval (clip-caption) | DiDeMo (test) | R@120.4 | 7 | |
| Video Retrieval | DiDeMo (test) | R@160 | 7 | |
| Text-to-Video Retrieval | DiDeMo 1 (val) | R@149 | 6 | |
| Text-to-Video Retrieval | DiDeMo CLIP-based (test) | R@148.4 | 5 | |
| Video-to-text retrieval | DiDeMo (full) | R@146 | 5 | |
| Video Grounding | DiDeMo (test) | R@1 (IoU=1.0)25.57 | 4 | |
| Video-to-Text Retrieval | DiDeMo CLIP-based (test) | R@147.7 | 4 |