Image Question Answering using Convolutional Neural Network with Dynamic Parameter Prediction
About
We tackle image question answering (ImageQA) problem by learning a convolutional neural network (CNN) with a dynamic parameter layer whose weights are determined adaptively based on questions. For the adaptive parameter prediction, we employ a separate parameter prediction network, which consists of gated recurrent unit (GRU) taking a question as its input and a fully-connected layer generating a set of candidate weights as its output. However, it is challenging to construct a parameter prediction network for a large number of parameters in the fully-connected dynamic parameter layer of the CNN. We reduce the complexity of this problem by incorporating a hashing technique, where the candidate weights given by the parameter prediction network are selected using a predefined hash function to determine individual weights in the dynamic parameter layer. The proposed network---joint network with the CNN for ImageQA and the parameter prediction network---is trained end-to-end through back-propagation, where its weights are initialized using a pre-trained CNN and GRU. The proposed algorithm illustrates the state-of-the-art performance on all available public ImageQA benchmarks.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Question Answering | VQA (test-dev) | Acc (All)62.48 | 147 | |
| Visual Question Answering | VQA (test-std) | -- | 110 | |
| Open-Ended Visual Question Answering | VQA 1.0 (test-dev) | Overall Accuracy57.22 | 100 | |
| Visual Question Answering (Multiple-choice) | VQA 1.0 (test-dev) | Accuracy (All)62.5 | 66 | |
| Visual Question Answering | COCO-QA (test) | WUPS (IoU=0.9)70.84 | 51 | |
| Open-Ended Visual Question Answering | VQA 1.0 (test-standard) | Overall Accuracy57.4 | 50 | |
| Image Question Answering | DAQUAR REDUCED (test) | Accuracy44.48 | 33 | |
| Image Retrieval | Fashion200k (test) | Recall@112.2 | 32 | |
| Open-Ended Visual Question Answering | VQA (test-standard) | Accuracy (Overall)57.4 | 32 | |
| Visual Question Answering | VQA 1 (test-standard) | VQA Open-Ended Accuracy (All)57.36 | 28 |