Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Inferring and Executing Programs for Visual Reasoning

About

Existing methods for visual reasoning attempt to directly map inputs to outputs using black-box architectures without explicitly modeling the underlying reasoning processes. As a result, these black-box models often learn to exploit biases in the data rather than learning to perform visual reasoning. Inspired by module networks, this paper proposes a model for visual reasoning that consists of a program generator that constructs an explicit representation of the reasoning process to be performed, and an execution engine that executes the resulting program to produce an answer. Both the program generator and the execution engine are implemented by neural networks, and are trained using a combination of backpropagation and REINFORCE. Using the CLEVR benchmark for visual reasoning, we show that our model significantly outperforms strong baselines and generalizes better in a variety of settings.

Justin Johnson, Bharath Hariharan, Laurens van der Maaten, Judy Hoffman, Li Fei-Fei, C. Lawrence Zitnick, Ross Girshick• 2017

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringCLEVR (test)
Overall Accuracy96.9
61
Visual Question AnsweringCLEVR 1.0 (test)
Overall Accuracy96.9
46
Visual Question AnsweringCLEVR-Humans
Accuracy66.6
24
Visual Question AnsweringCLEVR-Humans 1.0 (test)
Accuracy66.6
22
Visual Question AnsweringCLEVR-CoGenT (Condition A)
Accuracy96.6
21
Visual Question AnsweringCLEVR-CoGenT Condition B
Accuracy92.7
18
Visual Question AnsweringCLEVR-Humans (test)
Accuracy66.6
17
Visual Question AnsweringCLEVR-CoGenT (val)
Accuracy96.6
12
Visual ReasoningCLEVR 1.0 (test)
Overall Accuracy96.9
11
Visual Question AnsweringCLEVR pixels (test)
Overall Accuracy76.6
7
Showing 10 of 12 rows

Other info

Code

Follow for update