| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Rule-level anomaly detection | CLEVRER | AUROC0.844 | 15 | |
| Temporal Jigsaw Puzzle Solving | CLEVRER | Normalized Kendall Distance0 | 13 | |
| Temporal and causal video reasoning | CLEVRER-Humans (test) | Accuracy (Per Option)74.1 | 12 | |
| Counterfactual Prediction | CLEVRER Hypothesis | CF-Acc81 | 9 | |
| Visual Question Answering | CLEVRER 1.0 (test) | Descriptive Accuracy0.94 | 8 | |
| Video Question Answering | CLEVRER (test) | Descriptive Accuracy96.46 | 7 | |
| Segmentation | CLEVRER (Blender engine) zero-shot | Segmentation Map IoU (First Frame)67 | 6 | |
| Optical Flow | CLEVRER Full Sequence Blender (test) | Optical Flow EPE5.43 | 6 | |
| Optical Flow | CLEVRER First Frame Blender (test) | Optical Flow EPE2.79 | 6 | |
| Object Segmentation | CLEVRER Full Sequence Blender (test) | Segmentation Map IoU30 | 6 | |
| Object Segmentation | CLEVRER First Frame Blender (test) | Segmentation Map IoU67 | 6 | |
| Video Generation | CLEVRER 256x256 (test) | FVD87.4 | 6 | |
| Predictive Video Reasoning | CLEVRER (val) | Accuracy87.5 | 5 | |
| Counterfactual Video Reasoning | CLEVRER (val) | Accuracy86.69 | 5 | |
| Explanatory Video Reasoning | CLEVRER (val) | Accuracy99.94 | 5 | |
| Physical Reasoning | CLEVRER-LLMPhy | mIoU97.2 | 5 | |
| Collision Counting | CLEVRER T3 (val) | Accuracy77.84 | 4 | |
| Collision Event Detection | CLEVRER T2 (val) | Accuracy74.95 | 4 | |
| Collision Classification | CLEVRER T1 (val) | Accuracy93.84 | 4 | |
| Controllable Video Generation | CLEVRER (test) | SSIM92.52 | 4 | |
| Video Reasoning | CLEVRER | Accuracy78.5 | 4 | |
| Descriptive Video Reasoning | CLEVRER (val) | Accuracy97.99 | 3 | |
| Visual Question Answering | CLEVRER (test val) | Accuracy (per option)98.5 | 2 |