Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Neural Module Networks

About

Visual question answering is fundamentally compositional in nature---a question like "where is the dog?" shares substructure with questions like "what color is the dog?" and "where is the cat?" This paper seeks to simultaneously exploit the representational capacity of deep networks and the compositional linguistic structure of questions. We describe a procedure for constructing and learning *neural module networks*, which compose collections of jointly-trained neural "modules" into deep networks for question answering. Our approach decomposes questions into their linguistic substructures, and uses these structures to dynamically instantiate modular networks (with reusable components for recognizing dogs, classifying colors, etc.). The resulting compound networks are jointly trained. We evaluate our approach on two challenging datasets for visual question answering, achieving state-of-the-art results on both the VQA natural image dataset and a new dataset of complex questions about abstract shapes.

Jacob Andreas, Marcus Rohrbach, Trevor Darrell, Dan Klein• 2015

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringVQA (test-dev)
Acc (All)54.8
147
Open-Ended Visual Question AnsweringVQA 1.0 (test-dev)
Overall Accuracy58.6
100
Visual Question AnsweringCLEVR (test)
Overall Accuracy72.1
61
Open-Ended Visual Question AnsweringVQA 1.0 (test-standard)
Overall Accuracy58.7
50
Visual Question AnsweringCLEVR 1.0 (test)
Overall Accuracy72.1
46
Visual Question AnswerVQA 1.0 (test-dev)
Overall Accuracy58.6
44
Open-Ended Visual Question AnsweringVQA (test-standard)
Accuracy (Overall)58.7
32
Visual Question AnsweringVQA 1 (test-standard)
VQA Open-Ended Accuracy (All)58.7
28
Visual Question AnsweringVQA COCO 2015 v1.0 (test-dev)
Overall Accuracy54.8
16
Visual Question AnsweringTDIUC (test)
Accuracy79.56
6
Showing 10 of 13 rows

Other info

Follow for update