Improving Selective Visual Question Answering by Learning from Your Peers

About

Despite advances in Visual Question Answering (VQA), the ability of models to assess their own correctness remains underexplored. Recent work has shown that VQA models, out-of-the-box, can have difficulties abstaining from answering when they are wrong. The option to abstain, also called Selective Prediction, is highly relevant when deploying systems to users who must trust the system's output (e.g., VQA assistants for users with visual impairments). For such scenarios, abstention can be especially important as users may provide out-of-distribution (OOD) or adversarial inputs that make incorrect answers more likely. In this work, we explore Selective VQA in both in-distribution (ID) and OOD scenarios, where models are presented with mixtures of ID and OOD data. The goal is to maximize the number of questions answered while minimizing the risk of error on those questions. We propose a simple yet effective Learning from Your Peers (LYP) approach for training multimodal selection functions for making abstention decisions. Our approach uses predictions from models trained on distinct subsets of the training data as targets for optimizing a Selective VQA model. It does not require additional manual labels or held-out data and provides a signal for identifying examples that are easy/difficult to generalize to. In our extensive evaluations, we show this benefits a number of models across different architectures and scales. Overall, for ID, we reach 32.92% in the selective prediction metric coverage at 1% risk of error (C@1%) which doubles the previous best coverage of 15.79% on this task. For mixed ID/OOD, using models' softmax confidences for abstention decisions performs very poorly, answering <5% of questions at 1% risk of error even when faced with only 10% OOD examples, but a learned selection function with LYP can increase that to 25.38% C@1%.

Corentin Dancette, Spencer Whitehead, Rishabh Maheshwary, Ramakrishna Vedantam, Stefan Scherer, Xinlei Chen, Matthieu Cord, Marcus Rohrbach• 2023

Related benchmarks

Task	Dataset	Result
Visual Entailment	SNLI-VE (test)	Overall Accuracy77.91	199
Image-Text Matching	Winoground	--	26
Classification	Pets	AURC0.221	23
Classification	UCF101	AURC0.226	23
Image-Text Matching	FOIL	AURC0.225	23
Image-Text Matching	VL-Checklist	AURC0.232	23
Image-Text Matching	What’sUp	AURC22.8	23
Selective Visual Question Answering	Mixed ID/OOD 66.7% VQA v2 / 33.3% AdVQA (test)	Acc67.79	17
Selective Visual Question Answering	Mixed ID/OOD 50% VQA v2 / 50% AdVQA (test)	Accuracy63.01	17
Visual Question Answering	33.3% VQA v2 + 66.7% AdVQA (test)	Accuracy57.71	17

Showing 10 of 17 rows

Other info

Code

Follow for update

@wizwand_team Discord