| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Semantic Role Prediction | VidSitu (test) | CIDEr84.85 | 17 | |
| Event relation prediction | VidSitu | Mean Accuracy35.32 | 12 | |
| Video Situation Recognition | VidSitu | CIDEr76.24 | 9 | |
| Semantic Role Labeling | VidSitu (val) | CIDEr90.12 | 9 | |
| Verb Prediction | VidSitu (val) | Top-1 Verb Accuracy56.15 | 8 | |
| Verb prediction | VidSitu (test) | Accuracy@144.67 | 8 | |
| Semantic Role Labeling | VidSitu (test) | CIDEr83.68 | 5 | |
| Localization | VidSitu (val) | IoU @ 0.370.33 | 5 | |
| Video Semantic Role Labeling | VidSitu | CIDEr73.71 | 5 | |
| Semantic Role Labeling Captioning | VidSitu | CIDEr76.34 | 5 | |
| Localization | VidSitu (test) | IoU@0.359.64 | 3 | |
| Multimodal Event Extraction | VidSitu Aud | ET24.2 | 3 | |
| Video Tracking | VidSitu | V-Trck23.2 | 3 | |
| Event Relation | VidSitu | ER14.5 | 3 | |
| Event Typing | VidSitu | ET22.3 | 3 | |
| Video Tracking | VidSitu Txt | V-Trck34.4 | 3 | |
| Event Relation | VidSitu Txt | Event Relation (ER)23.1 | 3 | |
| Event Typing | VidSitu-Txt | ET Score32.8 | 3 | |
| Grounded Video Situation Recognition | VidSitu v1 (val) | Verb Accuracy@146.79 | 3 | |
| Grounded Video Situation Recognition | VidSitu (test) | Verb Acc@146.79 | 3 |