| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Composed Video Retrieval | WebVid-CoVR (test) | R@15,982 | 49 | |
| Video Reconstruction | WebVid 10M | PSNR35.76 | 34 | |
| Video Reconstruction | Webvid (val) | PSNR34.75 | 16 | |
| Video Captioning Evaluation | WebVid 10M | CLIP Score61.377 | 12 | |
| Video Annotation | WebVid-10M | Avg Length214.49 | 12 | |
| Video reconstruction | WebVid-10M (val) | VCPR492.5 | 10 | |
| Video Generation | WebVid mini (val) | FVD @ 1 Frame526 | 10 | |
| Video Generation | WebVid (test) | LPIPS0.135 | 7 | |
| Watermark Extraction | WebVid 1000 videos 10M | Average Frame Score (N=3)99.98 | 6 | |
| Video Watermarking Visual Quality | WebVid 10M | FVD361.3 | 6 | |
| Camera-controlled video generation | WebVid | RotErr3.162 | 5 | |
| Video Generation | webvid (test) | SubC92.8 | 5 | |
| Semantic Consistency | WebVid10M | CLIP-F0.93 | 5 | |
| Text-to-Video Generation | WebVid 400 samples (val) | CLIP Score0.308 | 4 | |
| Cross-View Video Generation | WebVid 200 monocular videos (test) | Subjective Consistency92.18 | 4 | |
| Image-to-Video Generation | WebVid (test) | FID29.94 | 4 | |
| Text-to-Video Generation | WebVid (test) | FID61.52 | 4 | |
| Video Editing | WebVid | PSNR33.07 | 4 | |
| Image-to-Video generation | WebVid 10M | Temporal Coherence96.9 | 4 | |
| Text-to-Video Generation | WebVid-10M 2-million | CLIP Score48.3 | 4 | |
| Text-to-Video Generation | WebVid-10M (val) | FVD292.35 | 4 | |
| Image-to-Video generation with fine-grained motion control | WebVid-10M 1k (val) | FVD59.88 | 3 | |
| Multi-subject and motion customization | WebVid subject pairs (test) | CLIP Text Alignment Score0.662 | 3 | |
| Geometry Consistency | WebVid10M | Rot. AUC @ 5°25.2 | 3 | |
| Transition Video Generation | Webvid10M (test) | LPIPS (First Frame)0.3794 | 3 |