Visual Entailment Task for Visually-Grounded Language Learning

About

We introduce a new inference task - Visual Entailment (VE) - which differs from traditional Textual Entailment (TE) tasks whereby a premise is defined by an image, rather than a natural language sentence as in TE tasks. A novel dataset SNLI-VE (publicly available at https://github.com/necla-ml/SNLI-VE) is proposed for VE tasks based on the Stanford Natural Language Inference corpus and Flickr30k. We introduce a differentiable architecture called the Explainable Visual Entailment model (EVE) to tackle the VE problem. EVE and several other state-of-the-art visual question answering (VQA) based models are evaluated on the SNLI-VE dataset, facilitating grounded language understanding and providing insights on how modern VQA based models perform.

Ning Xie, Farley Lai, Derek Doran, Asim Kadav• 2018

Related benchmarks

Task	Dataset	Result	Rank
Visual Entailment	SNLI-VE (test)	Overall Accuracy71.16		199
Visual Entailment	SNLI-VE (val)	Overall Accuracy71.56		111

Showing 2 of 2 rows

Other info

Follow for update

@wizwand_team Discord