| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Video Action Classification | COIN | Top-1 Acc95.3 | 33 | |
| Action Phase Classification | COIN | Phase Acc54.1 | 32 | |
| Action segmentation | COIN | Frame Accuracy70.02 | 29 | |
| Classification of Procedural Activities | COIN (test) | Accuracy90.81 | 23 | |
| Action Segmentation | COIN (test) | Frame Accuracy72.8 | 23 | |
| Continual Multimodal Instruction Tuning | CoIN ScienceQA TextVQA ImageNet GQA VizWiz Grounding Chameleon backbone | Accuracy68.71 | 22 | |
| Step Forecasting | COIN | Accuracy56.2 | 22 | |
| Procedure Planning | COIN T=3 (test) | SR30.12 | 21 | |
| Video Classification | COIN (test) | Top-1 Accuracy94.1 | 20 | |
| Keystep recognition | COIN (test) | Accuracy16.9 | 18 | |
| Task recognition | COIN | Accuracy90.5 | 14 | |
| Long-Term Video Understanding | COIN | Top-1 Acc96 | 14 | |
| Keystep recognition | COIN | Accuracy57.2 | 14 | |
| Video Question Answering | COIN | Accuracy97.8 | 13 | |
| Procedure Planning | COIN T=4 (test) | SR22.24 | 13 | |
| Next forecasting | COIN (test) | Top-1 Accuracy54.1 | 13 | |
| Action Recognition | COIN | Top-1 Acc90.4 | 12 | |
| Procedural Activities Classification | COIN | Accuracy90 | 12 | |
| Step recognition | COIN (test) | Top-1 Acc66.4 | 11 | |
| Symbolic Reasoning | Coin | Accuracy100 | 11 | |
| Instructional Video Understanding | COIN (test) | Step Recognition Top-1 Acc63.4 | 10 | |
| Task recognition | COIN (test) | Top-1 Acc92.7 | 9 | |
| Step localization | COIN | Accuracy59.6 | 8 | |
| Procedure Planning | COIN T=5 (test) | SR16.06 | 8 | |
| Visual Planners for human Assistance | COIN (test) | SR (T=3)25.5 | 6 |