| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Symbolic Reasoning | Coin | Accuracy100 | 45 | |
| Procedure Planning | COIN T=3 (test) | SR30.12 | 40 | |
| Video Action Classification | COIN | Top-1 Acc95.3 | 33 | |
| Action Phase Classification | COIN | Phase Acc54.1 | 32 | |
| Action segmentation | COIN | Frame Accuracy70.02 | 29 | |
| Step Forecasting | COIN | Accuracy56.2 | 26 | |
| Classification of Procedural Activities | COIN (test) | Accuracy90.81 | 23 | |
| Action Segmentation | COIN (test) | Frame Accuracy72.8 | 23 | |
| Visual Planning | COIN | Success Rate (SR)33.99 | 22 | |
| Task recognition | COIN | Accuracy94.5 | 22 | |
| Continual Multimodal Instruction Tuning | CoIN ScienceQA TextVQA ImageNet GQA VizWiz Grounding Chameleon backbone | Accuracy68.71 | 22 | |
| Procedure Planning | COIN T=4 (test) | SR31.56 | 21 | |
| Continual Learning | CoIN | Backward Transfer (BWT)-4.67 | 20 | |
| Video Classification | COIN (test) | Top-1 Accuracy94.1 | 20 | |
| Keystep recognition | COIN (test) | Accuracy16.9 | 18 | |
| Long-Term Video Understanding | COIN | Top-1 Acc96 | 14 | |
| Keystep recognition | COIN | Accuracy57.2 | 14 | |
| Consistent Video Retrieval | COIN (test) | Accuracy51.64 | 13 | |
| Video Question Answering | COIN | Accuracy97.8 | 13 | |
| Next forecasting | COIN (test) | Top-1 Accuracy54.1 | 13 | |
| Step recognition | COIN | Top-1 Accuracy67.3 | 12 | |
| Action Recognition | COIN | Top-1 Acc90.4 | 12 | |
| Procedural Activities Classification | COIN | Accuracy90 | 12 | |
| Step recognition | COIN (test) | Top-1 Acc66.4 | 11 | |
| Instructional Video Understanding | COIN (test) | Step Recognition Top-1 Acc63.4 | 10 |