Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Improving Selective Visual Question Answering by Learning from Your Peers

About

Despite advances in Visual Question Answering (VQA), the ability of models to assess their own correctness remains underexplored. Recent work has shown that VQA models, out-of-the-box, can have difficulties abstaining from answering when they are wrong. The option to abstain, also called Selective Prediction, is highly relevant when deploying systems to users who must trust the system's output (e.g., VQA assistants for users with visual impairments). For such scenarios, abstention can be especially important as users may provide out-of-distribution (OOD) or adversarial inputs that make incorrect answers more likely. In this work, we explore Selective VQA in both in-distribution (ID) and OOD scenarios, where models are presented with mixtures of ID and OOD data. The goal is to maximize the number of questions answered while minimizing the risk of error on those questions. We propose a simple yet effective Learning from Your Peers (LYP) approach for training multimodal selection functions for making abstention decisions. Our approach uses predictions from models trained on distinct subsets of the training data as targets for optimizing a Selective VQA model. It does not require additional manual labels or held-out data and provides a signal for identifying examples that are easy/difficult to generalize to. In our extensive evaluations, we show this benefits a number of models across different architectures and scales. Overall, for ID, we reach 32.92% in the selective prediction metric coverage at 1% risk of error (C@1%) which doubles the previous best coverage of 15.79% on this task. For mixed ID/OOD, using models' softmax confidences for abstention decisions performs very poorly, answering <5% of questions at 1% risk of error even when faced with only 10% OOD examples, but a learned selection function with LYP can increase that to 25.38% C@1%.

Corentin Dancette, Spencer Whitehead, Rishabh Maheshwary, Ramakrishna Vedantam, Stefan Scherer, Xinlei Chen, Matthieu Cord, Marcus Rohrbach• 2023

Related benchmarks

TaskDatasetResultRank
Visual EntailmentSNLI-VE (test)
Overall Accuracy77.91
197
Image-Text MatchingWinoground--
26
ClassificationPets
AURC0.221
23
ClassificationUCF101
AURC0.226
23
Image-Text MatchingFOIL
AURC0.225
23
Image-Text MatchingVL-Checklist
AURC0.232
23
Image-Text MatchingWhat’sUp
AURC22.8
23
Selective Visual Question AnsweringMixed ID/OOD 66.7% VQA v2 / 33.3% AdVQA (test)
Acc67.79
17
Selective Visual Question AnsweringMixed ID/OOD 50% VQA v2 / 50% AdVQA (test)
Accuracy63.01
17
Visual Question Answering33.3% VQA v2 + 66.7% AdVQA (test)
Accuracy57.71
17
Showing 10 of 17 rows

Other info

Code

Follow for update