VIGiA: Instructional Video Guidance via Dialogue Reasoning and Retrieval
About
We introduce VIGiA, a novel multimodal dialogue model designed to understand and reason over complex, multi-step instructional video action plans. Unlike prior work which focuses mainly on text-only guidance, or treats vision and language in isolation, VIGiA supports grounded, plan-aware dialogue that requires reasoning over visual inputs, instructional plans, and interleaved user interactions. To this end, VIGiA incorporates two key capabilities: (1) multimodal plan reasoning, enabling the model to align uni- and multimodal queries with the current task plan and respond accurately; and (2) plan-based retrieval, allowing it to retrieve relevant plan steps in either textual or visual representations. Experiments were done on a novel dataset with rich Instructional Video Dialogues aligned with Cooking and DIY plans. Our evaluation shows that VIGiA outperforms existing state-of-the-art models on all tasks in a conversational plan guidance setting, reaching over 90\% accuracy on plan-aware VQA.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Plan-Grounded Answer Generation | InstructionVidDial (test) | ROUGE-L75.3 | 8 | |
| Plan-grounded Visual Question Answering | InstructionVidDial (test) | ROUGE-L33.65 | 8 | |
| Visual Slot Grounding | InstructionVidDial (test) | ROUGE-L55.66 | 8 | |
| Dialogue-level Guidance Quality Evaluation | Dialogue-level evaluation (N=54) | State Tracking3.3 | 6 | |
| Contextual Video-Moment Retrieval | InstructionVidDial (test) | Recall@1 (IoU=0.5)30.74 | 4 |