Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

FiLM: Visual Reasoning with a General Conditioning Layer

About

We introduce a general-purpose conditioning method for neural networks called FiLM: Feature-wise Linear Modulation. FiLM layers influence neural network computation via a simple, feature-wise affine transformation based on conditioning information. We show that FiLM layers are highly effective for visual reasoning - answering image-related questions which require a multi-step, high-level process - a task which has proven difficult for standard deep learning methods that do not explicitly model reasoning. Specifically, we show on visual reasoning tasks that FiLM layers 1) halve state-of-the-art error for the CLEVR benchmark, 2) modulate features in a coherent manner, 3) are robust to ablations and architectural modifications, and 4) generalize well to challenging, new data from few examples or even zero-shot.

Ethan Perez, Florian Strub, Harm de Vries, Vincent Dumoulin, Aaron Courville• 2017

Related benchmarks

TaskDatasetResultRank
Composed Image RetrievalFashionIQ (val)
Shirt Recall@1015.04
455
Composed Image RetrievalFashion-IQ (test)
Dress Recall@100.1423
145
Multimodal Emotion RecognitionIEMOCAP (test)
Accuracy74.32
118
Audio-Image-Text ClassificationIEMOCAP (test)
Accuracy74.32
116
Multimodal Multilabel ClassificationMM-IMDB (test)
Macro F159.7
87
Visual Question AnsweringCLEVR (test)
Overall Accuracy97.7
61
Audio-Visual ClassificationCREMA-D (test)
Accuracy60.07
60
Multimodal ClassificationKS (test)
Accuracy63.33
48
Multimodal ClassificationMVSA (test)
Accuracy (%)75.34
48
Visual Question AnsweringCLEVR 1.0 (test)
Overall Accuracy97.7
46
Showing 10 of 42 rows

Other info

Code

Follow for update