| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Semantic Role Prediction | VidSitu (test) | CIDEr84.85 | 17 | |
| Event relation prediction | VidSitu | Mean Accuracy35.32 | 12 | |
| Verb prediction | VidSitu (test) | Accuracy@144.67 | 7 | |
| Multimodal Event Extraction | VidSitu Aud | ET24.2 | 3 | |
| Video Tracking | VidSitu | V-Trck23.2 | 3 | |
| Event Relation | VidSitu | ER14.5 | 3 | |
| Event Typing | VidSitu | ET22.3 | 3 | |
| Video Tracking | VidSitu Txt | V-Trck34.4 | 3 | |
| Event Relation | VidSitu Txt | Event Relation (ER)23.1 | 3 | |
| Event Typing | VidSitu-Txt | ET Score32.8 | 3 | |
| Grounded Video Situation Recognition | VidSitu v1 (val) | Verb Accuracy@146.79 | 3 | |
| Grounded Video Situation Recognition | VidSitu (test) | Verb Acc@146.79 | 3 |