Learning to Assemble Neural Module Tree Networks for Visual Grounding
About
Visual grounding, a task to ground (i.e., localize) natural language in images, essentially requires composite visual reasoning. However, existing methods over-simplify the composite nature of language into a monolithic sentence embedding or a coarse composition of subject-predicate-object triplet. In this paper, we propose to ground natural language in an intuitive, explainable, and composite fashion as it should be. In particular, we develop a novel modular network called Neural Module Tree network (NMTree) that regularizes the visual grounding along the dependency parsing tree of the sentence, where each node is a neural module that calculates visual attention according to its linguistic feature, and the grounding score is accumulated in a bottom-up direction where as needed. NMTree disentangles the visual grounding from the composite reasoning, allowing the former to only focus on primitive and easy-to-generalize patterns. To reduce the impact of parsing errors, we train the modules and their assembly end-to-end by using the Gumbel-Softmax approximation and its straight-through gradient estimator, accounting for the discrete nature of module assembly. Overall, the proposed NMTree consistently outperforms the state-of-the-arts on several benchmarks. Qualitative results show explainable grounding score calculation in great detail.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Referring Expression Comprehension | RefCOCO+ (val) | Accuracy66.46 | 354 | |
| Referring Expression Comprehension | RefCOCO (val) | Accuracy76.41 | 344 | |
| Referring Expression Comprehension | RefCOCO (testA) | Accuracy0.8121 | 342 | |
| Referring Expression Comprehension | RefCOCOg (test) | Accuracy61.46 | 300 | |
| Referring Expression Comprehension | RefCOCOg (val) | Accuracy61.01 | 300 | |
| Referring Image Segmentation | RefCOCO (val) | mIoU56.59 | 259 | |
| Referring Expression Segmentation | RefCOCO (testA) | -- | 257 | |
| Referring Image Segmentation | RefCOCO+ (test-B) | mIoU41.56 | 252 | |
| Referring Expression Comprehension | RefCOCO+ (testB) | Accuracy57.52 | 244 | |
| Referring Image Segmentation | RefCOCO (test A) | mIoU63.02 | 230 |