| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Composed Video Retrieval | WebVid-CoVR (test) | R@15,982 | 45 | |
| Video Reconstruction | WebVid 10M | PSNR35.76 | 34 | |
| Video Reconstruction | Webvid (val) | PSNR34.75 | 16 | |
| Video Captioning Evaluation | WebVid 10M | CLIP Score61.377 | 12 | |
| Video Annotation | WebVid-10M | Avg Length214.49 | 12 | |
| Video reconstruction | WebVid-10M (val) | VCPR492.5 | 10 | |
| Video Generation | WebVid mini (val) | FVD @ 1 Frame526 | 10 | |
| Video Generation | WebVid (test) | LPIPS0.135 | 7 | |
| Camera-controlled video generation | WebVid | RotErr3.162 | 5 | |
| Video Generation | webvid (test) | SubC92.8 | 5 | |
| Semantic Consistency | WebVid10M | CLIP-F0.93 | 5 | |
| Image-to-Video Generation | WebVid (test) | FID29.94 | 4 | |
| Text-to-Video Generation | WebVid (test) | FID61.52 | 4 | |
| Video Editing | WebVid | PSNR33.07 | 4 | |
| Image-to-Video generation | WebVid 10M | Temporal Coherence96.9 | 4 | |
| Text-to-Video Generation | WebVid-10M 2-million | CLIP Score48.3 | 4 | |
| Text-to-Video Generation | WebVid-10M (val) | FVD292.35 | 4 | |
| Image-to-Video generation with fine-grained motion control | WebVid-10M 1k (val) | FVD59.88 | 3 | |
| Multi-subject and motion customization | WebVid subject pairs (test) | CLIP Text Alignment Score0.662 | 3 | |
| Geometry Consistency | WebVid10M | Rot. AUC @ 5°25.2 | 3 | |
| Transition Video Generation | Webvid10M (test) | LPIPS (First Frame)0.3794 | 3 | |
| Image-to-Video Generation | WebVid-10M (val) | F-Consistency (4)95.36 | 3 | |
| Text-to-Video Generation | WebVid10M | FID7.64 | 3 | |
| Text-guided Video Editing | WebVid-10M (val) | Frame Consistency (CLIP Score)94.9 | 2 |