FiLM: Visual Reasoning with a General Conditioning Layer

About

We introduce a general-purpose conditioning method for neural networks called FiLM: Feature-wise Linear Modulation. FiLM layers influence neural network computation via a simple, feature-wise affine transformation based on conditioning information. We show that FiLM layers are highly effective for visual reasoning - answering image-related questions which require a multi-step, high-level process - a task which has proven difficult for standard deep learning methods that do not explicitly model reasoning. Specifically, we show on visual reasoning tasks that FiLM layers 1) halve state-of-the-art error for the CLEVR benchmark, 2) modulate features in a coherent manner, 3) are robust to ablations and architectural modifications, and 4) generalize well to challenging, new data from few examples or even zero-shot.

Ethan Perez, Florian Strub, Harm de Vries, Vincent Dumoulin, Aaron Courville• 2017

Related benchmarks

Task	Dataset	Result
Composed Image Retrieval	FashionIQ (val)	Average Recall@1015.52	601
Composed Image Retrieval	Fashion-IQ (test)	Average Recall@100.1552	176
Multimodal Emotion Recognition	IEMOCAP (test)	Accuracy74.32	162
Audio-Image-Text Classification	IEMOCAP (test)	Accuracy74.32	116
Multimodal Multilabel Classification	MM-IMDB (test)	Macro F159.7	94
Visual Question Answering	CLEVR (test)	Overall Accuracy97.7	61
Audio-Visual Classification	CREMA-D (test)	Accuracy60.07	60
Image Retrieval	Fashion200k (test)	Recall@112.9	58
Multimodal Classification	KS (test)	Accuracy63.33	48
Multimodal Classification	MVSA (test)	Accuracy (%)75.34	48

Showing 10 of 52 rows

Other info

Code

Follow for update

@wizwand_team Discord