Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Learning to Assemble Neural Module Tree Networks for Visual Grounding

About

Visual grounding, a task to ground (i.e., localize) natural language in images, essentially requires composite visual reasoning. However, existing methods over-simplify the composite nature of language into a monolithic sentence embedding or a coarse composition of subject-predicate-object triplet. In this paper, we propose to ground natural language in an intuitive, explainable, and composite fashion as it should be. In particular, we develop a novel modular network called Neural Module Tree network (NMTree) that regularizes the visual grounding along the dependency parsing tree of the sentence, where each node is a neural module that calculates visual attention according to its linguistic feature, and the grounding score is accumulated in a bottom-up direction where as needed. NMTree disentangles the visual grounding from the composite reasoning, allowing the former to only focus on primitive and easy-to-generalize patterns. To reduce the impact of parsing errors, we train the modules and their assembly end-to-end by using the Gumbel-Softmax approximation and its straight-through gradient estimator, accounting for the discrete nature of module assembly. Overall, the proposed NMTree consistently outperforms the state-of-the-arts on several benchmarks. Qualitative results show explainable grounding score calculation in great detail.

Daqing Liu, Hanwang Zhang, Feng Wu, Zheng-Jun Zha• 2018

Related benchmarks

TaskDatasetResultRank
Referring Expression ComprehensionRefCOCO+ (val)
Accuracy66.46
354
Referring Expression ComprehensionRefCOCO (val)
Accuracy76.41
344
Referring Expression ComprehensionRefCOCO (testA)
Accuracy0.8121
342
Referring Expression ComprehensionRefCOCOg (test)
Accuracy61.46
300
Referring Expression ComprehensionRefCOCOg (val)
Accuracy61.01
300
Referring Image SegmentationRefCOCO (val)
mIoU56.59
259
Referring Expression SegmentationRefCOCO (testA)--
257
Referring Image SegmentationRefCOCO+ (test-B)
mIoU41.56
252
Referring Expression ComprehensionRefCOCO+ (testB)
Accuracy57.52
244
Referring Image SegmentationRefCOCO (test A)
mIoU63.02
230
Showing 10 of 49 rows

Other info

Follow for update