Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

VIGiA: Instructional Video Guidance via Dialogue Reasoning and Retrieval

About

We introduce VIGiA, a novel multimodal dialogue model designed to understand and reason over complex, multi-step instructional video action plans. Unlike prior work which focuses mainly on text-only guidance, or treats vision and language in isolation, VIGiA supports grounded, plan-aware dialogue that requires reasoning over visual inputs, instructional plans, and interleaved user interactions. To this end, VIGiA incorporates two key capabilities: (1) multimodal plan reasoning, enabling the model to align uni- and multimodal queries with the current task plan and respond accurately; and (2) plan-based retrieval, allowing it to retrieve relevant plan steps in either textual or visual representations. Experiments were done on a novel dataset with rich Instructional Video Dialogues aligned with Cooking and DIY plans. Our evaluation shows that VIGiA outperforms existing state-of-the-art models on all tasks in a conversational plan guidance setting, reaching over 90\% accuracy on plan-aware VQA.

Diogo Gl\'oria-Silva, David Semedo, Jo\~ao Maglh\~aes• 2026

Related benchmarks

TaskDatasetResultRank
Plan-Grounded Answer GenerationInstructionVidDial (test)
ROUGE-L75.3
8
Plan-grounded Visual Question AnsweringInstructionVidDial (test)
ROUGE-L33.65
8
Visual Slot GroundingInstructionVidDial (test)
ROUGE-L55.66
8
Dialogue-level Guidance Quality EvaluationDialogue-level evaluation (N=54)
State Tracking3.3
6
Contextual Video-Moment RetrievalInstructionVidDial (test)
Recall@1 (IoU=0.5)30.74
4
Showing 5 of 5 rows

Other info

Follow for update