FiLM: Visual Reasoning with a General Conditioning Layer
About
We introduce a general-purpose conditioning method for neural networks called FiLM: Feature-wise Linear Modulation. FiLM layers influence neural network computation via a simple, feature-wise affine transformation based on conditioning information. We show that FiLM layers are highly effective for visual reasoning - answering image-related questions which require a multi-step, high-level process - a task which has proven difficult for standard deep learning methods that do not explicitly model reasoning. Specifically, we show on visual reasoning tasks that FiLM layers 1) halve state-of-the-art error for the CLEVR benchmark, 2) modulate features in a coherent manner, 3) are robust to ablations and architectural modifications, and 4) generalize well to challenging, new data from few examples or even zero-shot.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Composed Image Retrieval | FashionIQ (val) | Shirt Recall@1015.04 | 455 | |
| Composed Image Retrieval | Fashion-IQ (test) | Dress Recall@100.1423 | 145 | |
| Multimodal Emotion Recognition | IEMOCAP (test) | Accuracy74.32 | 118 | |
| Audio-Image-Text Classification | IEMOCAP (test) | Accuracy74.32 | 116 | |
| Multimodal Multilabel Classification | MM-IMDB (test) | Macro F159.7 | 87 | |
| Visual Question Answering | CLEVR (test) | Overall Accuracy97.7 | 61 | |
| Audio-Visual Classification | CREMA-D (test) | Accuracy60.07 | 60 | |
| Multimodal Classification | KS (test) | Accuracy63.33 | 48 | |
| Multimodal Classification | MVSA (test) | Accuracy (%)75.34 | 48 | |
| Visual Question Answering | CLEVR 1.0 (test) | Overall Accuracy97.7 | 46 |