| Task Name | Dataset Name | SOTA Result | Trend | |
|---|---|---|---|---|
| Visual Slot Grounding | InstructionVidDial (test) | ROUGE-L55.66 | 8 | |
| Plan-grounded Visual Question Answering | InstructionVidDial (test) | ROUGE-L33.65 | 8 | |
| Plan-Grounded Answer Generation | InstructionVidDial (test) | ROUGE-L75.3 | 8 | |
| Contextual Video-Moment Retrieval | InstructionVidDial (test) | Recall@1 (IoU=0.5)30.74 | 4 |