Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Learning to Assemble Neural Module Tree Networks for Visual Grounding

About

Visual grounding, a task to ground (i.e., localize) natural language in images, essentially requires composite visual reasoning. However, existing methods over-simplify the composite nature of language into a monolithic sentence embedding or a coarse composition of subject-predicate-object triplet. In this paper, we propose to ground natural language in an intuitive, explainable, and composite fashion as it should be. In particular, we develop a novel modular network called Neural Module Tree network (NMTree) that regularizes the visual grounding along the dependency parsing tree of the sentence, where each node is a neural module that calculates visual attention according to its linguistic feature, and the grounding score is accumulated in a bottom-up direction where as needed. NMTree disentangles the visual grounding from the composite reasoning, allowing the former to only focus on primitive and easy-to-generalize patterns. To reduce the impact of parsing errors, we train the modules and their assembly end-to-end by using the Gumbel-Softmax approximation and its straight-through gradient estimator, accounting for the discrete nature of module assembly. Overall, the proposed NMTree consistently outperforms the state-of-the-arts on several benchmarks. Qualitative results show explainable grounding score calculation in great detail.

Daqing Liu, Hanwang Zhang, Feng Wu, Zheng-Jun Zha• 2018

Related benchmarks

TaskDatasetResultRank
Referring Expression ComprehensionRefCOCO+ (val)
Accuracy66.46
345
Referring Expression ComprehensionRefCOCO (val)
Accuracy76.41
335
Referring Expression ComprehensionRefCOCO (testA)
Accuracy0.8121
333
Referring Expression ComprehensionRefCOCOg (test)
Accuracy61.46
291
Referring Expression ComprehensionRefCOCOg (val)
Accuracy61.01
291
Referring Expression ComprehensionRefCOCO+ (testB)
Accuracy57.52
235
Referring Expression SegmentationRefCOCO (testA)--
217
Referring Expression ComprehensionRefCOCO+ (testA)
Accuracy72.02
207
Referring Expression SegmentationRefCOCO+ (val)--
201
Referring Image SegmentationRefCOCO+ (test-B)
mIoU41.56
200
Showing 10 of 49 rows

Other info

Follow for update