MoVie: Revisiting Modulated Convolutions for Visual Counting and Beyond

About

This paper focuses on visual counting, which aims to predict the number of occurrences given a natural image and a query (e.g. a question or a category). Unlike most prior works that use explicit, symbolic models which can be computationally expensive and limited in generalization, we propose a simple and effective alternative by revisiting modulated convolutions that fuse the query and the image locally. Following the design of residual bottleneck, we call our method MoVie, short for Modulated conVolutional bottlenecks. Notably, MoVie reasons implicitly and holistically and only needs a single forward-pass during inference. Nevertheless, MoVie showcases strong performance for counting: 1) advancing the state-of-the-art on counting-specific VQA tasks while being more efficient; 2) outperforming prior-art on difficult benchmarks like COCO for common object counting; 3) helped us secure the first place of 2020 VQA challenge when integrated as a module for 'number' related questions in generic VQA models. Finally, we show evidence that modulated convolutions such as MoVie can serve as a general mechanism for reasoning tasks beyond counting.

Duy-Kien Nguyen, Vedanuj Goswami, Xinlei Chen• 2020

Related benchmarks

Task	Dataset	Result
Visual Question Answering	VQA v2 (test-dev)	Overall Accuracy69.26	721
Visual Question Answering	VQA 2.0 (test-dev)	Accuracy69.26	337
Visual Question Answering	GQA (test)	Accuracy57.1	204
Visual Question Answering	GQA (test-std)	Accuracy57.1	74
Visual Question Answering	CLEVR (test)	Overall Accuracy97.42	61
Object Counting	Pascal VOC (test)	RMSE0.36	27
Object Counting	COCO (test)	RMSE0.3	16
Open-ended counting	TallyQA (test)	Simple Accuracy74.9	14
Visual Question Answering	TallyQA complex	Accuracy56.8	13
Visual Question Answering	TallyQA simple	Accuracy74.9	10

Showing 10 of 12 rows

Other info

Follow for update

@wizwand_team Discord