| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Text-to-Video Retrieval | YouCook2 | Recall@1083.7 | 117 | |
| Video Captioning | YouCook II (val) | CIDEr221 | 98 | |
| Dense Video Captioning | YouCook2 | SODA_c7.3 | 29 | |
| Video Level Summarization | YouCook2 | METEOR30.08 | 21 | |
| Action Grounding | YouCook-Interactions (val) | Accuracy60.05 | 13 | |
| Video Retrieval | Youcook2 | R@110.9 | 6 | |
| Audio-to-video retrieval | YouCook (test) | Recall@133.1 | 4 | |
| Object Classification | YouCook II (val) | Object Top-1 Acc13.2 | 4 | |
| Verb Classification | YouCook II (val) | Top-1 Accuracy0.161 | 4 | |
| Video Retrieval | YouCook II (test) | Avg Recall@{1,5,10}50.6 | 3 | |
| Video-Audio Captioning | YouCook2 | CIDEr197.8 | 2 | |
| Object localization | YouCook2-BB | Full Localization0.5925 | 2 | |
| Video/Paragraph Retrieval (Video-to-Text) | YouCook2 | Recall@151.3 | 2 |