| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Text-to-Video Generation | UCF-101 | FVD81 | 61 | |
| Text-to-Video Generation | UCF-101 zero-shot | FVD200.2 | 44 | |
| Video Classification | UCF-101 (test) | Top-1 Acc98.1 | 41 | |
| Action Recognition | UCF-101 Few-shot | Top-1 Accuracy94.6 | 30 | |
| Text-to-Video Generation | UCF-101 (test) | FVD217.24 | 25 | |
| Adversarial Video Purification | UCF-101 (test) | Clean Accuracy96 | 24 | |
| Action Recognition | UCF-101 1.0 (test) | Top-1 Acc95.9 | 23 | |
| Class-conditional video generation | UCF-101 v1.0 (train test) | FVD66.32 | 21 | |
| Video Classification | UCF-101 (split-1) | Accuracy89.08 | 21 | |
| Image Classification | UCF-101 (test) | Accuracy90.67 | 18 | |
| Action Detection | UCF-101-24 (test) | F1 Score (IoU=0.5)90.7 | 15 | |
| Few-shot video recognition | UCF-101 | Top-1 Acc (K=2)88.1 | 13 | |
| Text-to-Video Generation | UCF-101 (fine-tuning) | IS95.23 | 13 | |
| Long Video Generation | UCF-101 128-frame (test) | FVD968 | 13 | |
| Action Recognition | UCF-101 (81/20) | Accuracy64.8 | 13 | |
| Video Generation | UCF-101 16-frame | IS90.52 | 12 | |
| Unconditional video generation | UCF-101 256x256 | FVD (256x256, 2048)210.6 | 12 | |
| Video Generation | UCF-101 64 x 64 (test) | FVD158.7 | 12 | |
| Audio-visual Recognition | UCF-101 (full) | Top-1 Accuracy97.2 | 11 | |
| Action Recognition | UCF-101 | Accuracy (ACC)90.4 | 10 | |
| Activity Recognition | UCF-101 first split among three (test) | Top-1 Accuracy72.4 | 10 | |
| Action Detection | UCF-101-24 (split 1) | Frame mAP (IoU=0.5)84.8 | 10 | |
| Video Action Recognition | UCF-101 (val) | Top-1 Acc (K=2)85.2 | 8 | |
| Image Classification | UCF-101 all-to-all | Accuracy71.6 | 7 | |
| Unconditional Video Generation | UCF-101 | FVD (2048 Dim)279 | 7 |