Neural Module Networks

About

Visual question answering is fundamentally compositional in nature---a question like "where is the dog?" shares substructure with questions like "what color is the dog?" and "where is the cat?" This paper seeks to simultaneously exploit the representational capacity of deep networks and the compositional linguistic structure of questions. We describe a procedure for constructing and learning *neural module networks*, which compose collections of jointly-trained neural "modules" into deep networks for question answering. Our approach decomposes questions into their linguistic substructures, and uses these structures to dynamically instantiate modular networks (with reusable components for recognizing dogs, classifying colors, etc.). The resulting compound networks are jointly trained. We evaluate our approach on two challenging datasets for visual question answering, achieving state-of-the-art results on both the VQA natural image dataset and a new dataset of complex questions about abstract shapes.

Jacob Andreas, Marcus Rohrbach, Trevor Darrell, Dan Klein• 2015

Related benchmarks

Task	Dataset	Result
Visual Question Answering	VQA (test-dev)	Acc (All)54.8	147
Open-Ended Visual Question Answering	VQA 1.0 (test-dev)	Overall Accuracy58.6	100
Visual Question Answering	CLEVR (test)	Overall Accuracy72.1	61
Open-Ended Visual Question Answering	VQA 1.0 (test-standard)	Overall Accuracy58.7	50
Visual Question Answering	CLEVR 1.0 (test)	Overall Accuracy72.1	46
Visual Question Answer	VQA 1.0 (test-dev)	Overall Accuracy58.6	44
Open-Ended Visual Question Answering	VQA (test-standard)	Accuracy (Overall)58.7	32
Visual Question Answering	VQA 1 (test-standard)	VQA Open-Ended Accuracy (All)58.7	28
Visual Question Answering	VQA COCO 2015 v1.0 (test-dev)	Overall Accuracy54.8	16
Visual Question Answering	TDIUC (test)	Accuracy79.56	6

Showing 10 of 13 rows

Other info

Follow for update

@wizwand_team Discord