Selective "Selective Prediction": Reducing Unnecessary Abstention in Vision-Language Reasoning

About

Selective prediction minimizes incorrect predictions from vision-language models (VLMs) by allowing them to abstain from answering when uncertain. However, when deploying a vision-language system with low tolerance for inaccurate predictions, selective prediction may be over-cautious and abstain too frequently, even on many correct predictions. We introduce ReCoVERR, an inference-time algorithm to reduce the over-abstention of a selective vision-language system without increasing the error rate of the system's predictions. When the VLM makes a low-confidence prediction, instead of abstaining ReCoVERR tries to find relevant clues in the image that provide additional evidence for the prediction. ReCoVERR uses an LLM to pose related questions to the VLM, collects high-confidence evidences, and if enough evidence confirms the prediction the system makes a prediction instead of abstaining. ReCoVERR enables three VLMs (BLIP2, InstructBLIP, and LLaVA-1.5) to answer up to 20% more questions on the VQAv2 and A-OKVQA tasks without decreasing system accuracy, thus improving overall system reliability. Our code is available at https://github.com/tejas1995/ReCoVERR.

Tejas Srinivasan, Jack Hessel, Tanmay Gupta, Bill Yuchen Lin, Yejin Choi, Jesse Thomason, Khyathi Raghavi Chandu• 2024

Related benchmarks

Task	Dataset	Result
Image-Text Matching	Winoground	--	26
Classification	Flowers	AURC22.5	23
Classification	Pets	AURC0.216	23
Classification	UCF101	AURC0.223	23
Image-Text Matching	What’sUp	AURC24.9	23
Image-Text Matching	VL-Checklist	AURC0.247	23
Image-Text Matching	FOIL	AURC0.242	23
Image-Text Matching	SugarCrepe	AURC16.7	17
Captioning	Flickr 30k	AURC (CIDEr-N)0.253	15
Captioning	MS-COCO	AURC (CIDEr-N)0.142	15

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord