Fusion of Detected Objects in Text for Visual Question Answering

About

To advance models of multimodal context, we introduce a simple yet powerful neural architecture for data that combines vision and natural language. The "Bounding Boxes in Text Transformer" (B2T2) also leverages referential information binding words to portions of the image in a single unified architecture. B2T2 is highly effective on the Visual Commonsense Reasoning benchmark (https://visualcommonsense.com), achieving a new state-of-the-art with a 25% relative reduction in error rate compared to published baselines and obtaining the best performance to date on the public leaderboard (as of May 22, 2019). A detailed ablation analysis shows that the early integration of the visual features into the text analysis is key to the effectiveness of the new architecture. A reference implementation of our models is provided (https://github.com/google-research/language/tree/master/language/question_answering/b2t2).

Chris Alberti, Jeffrey Ling, Michael Collins, David Reitter• 2019

Related benchmarks

Task	Dataset	Result
Visual Commonsense Reasoning	VCR (val)	Accuracy76	63
Visual Commonsense Reasoning	VCR (Visual Commonsense Reasoning) (test)	Accuracy75.7	54
Predicting object attributes	PIGPEN-NLU (val)	Accuracy61.3	12
Predicting object attributes	PIGPEN-NLU seen (test)	Accuracy71.4	12
Predicting object attributes	PIGPEN-NLU overall (test)	Accuracy53.9	12
Predicting object attributes	PIGPEN-NLU unseen (test)	Accuracy48.1	12
Rationale Selection (QA -> R)	VCR (val)	Accuracy76	8
Holistic Reasoning (Q -> AR)	VCR (val)	Accuracy54.9	8
Visual Question Answering (Q -> A)	VCR (val)	Accuracy71.9	8
Visual Question Answering (Q -> A)	VCR (test)	Accuracy72.6	6

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord