Visual Coreference Resolution in Visual Dialog using Neural Module Networks
About
Visual dialog entails answering a series of questions grounded in an image, using dialog history as context. In addition to the challenges found in visual question answering (VQA), which can be seen as one-round dialog, visual dialog encompasses several more. We focus on one such problem called visual coreference resolution that involves determining which words, typically noun phrases and pronouns, co-refer to the same entity/object instance in an image. This is crucial, especially for pronouns (e.g., `it'), as the dialog agent must first link it to a previous coreference (e.g., `boat'), and only then can rely on the visual grounding of the coreference `boat' to reason about the pronoun `it'. Prior work (in visual dialog) models visual coreference resolution either (a) implicitly via a memory network over history, or (b) at a coarse level for the entire question; and not explicitly at a phrase level of granularity. In this work, we propose a neural module network architecture for visual dialog by introducing two novel modules - Refer and Exclude - that perform explicit, grounded, coreference resolution at a finer word level. We demonstrate the effectiveness of our model on MNIST Dialog, a visually simple yet coreference-wise complex dataset, by achieving near perfect accuracy, and on VisDial, a large and challenging visual dialog dataset on real images, where our model outperforms other approaches, and is more interpretable, grounded, and consistent qualitatively.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Dialog | VisDial v0.9 (val) | MRR64.1 | 141 | |
| Visual Dialog | VisDial v1.0 (test-std) | NDCG54.7 | 77 | |
| Visual Dialog | VisDial 1.0 (val) | MRR0.61 | 65 | |
| Visual Dialog | VisDial v0.9 (test) | MRR64.1 | 58 | |
| Visual Dialog Retrieval | VisDial v1.0 (test-standard) | MRR61.5 | 51 | |
| Visual Dialog | MNIST Dialog (test) | Accuracy0.993 | 7 |