FiLM: Visual Reasoning with a General Conditioning Layer
About
We introduce a general-purpose conditioning method for neural networks called FiLM: Feature-wise Linear Modulation. FiLM layers influence neural network computation via a simple, feature-wise affine transformation based on conditioning information. We show that FiLM layers are highly effective for visual reasoning - answering image-related questions which require a multi-step, high-level process - a task which has proven difficult for standard deep learning methods that do not explicitly model reasoning. Specifically, we show on visual reasoning tasks that FiLM layers 1) halve state-of-the-art error for the CLEVR benchmark, 2) modulate features in a coherent manner, 3) are robust to ablations and architectural modifications, and 4) generalize well to challenging, new data from few examples or even zero-shot.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Composed Image Retrieval | FashionIQ (val) | Average Recall@1015.52 | 489 | |
| Composed Image Retrieval | Fashion-IQ (test) | Average Recall@100.1552 | 169 | |
| Multimodal Emotion Recognition | IEMOCAP (test) | Accuracy74.32 | 162 | |
| Audio-Image-Text Classification | IEMOCAP (test) | Accuracy74.32 | 116 | |
| Multimodal Multilabel Classification | MM-IMDB (test) | Macro F159.7 | 87 | |
| Visual Question Answering | CLEVR (test) | Overall Accuracy97.7 | 61 | |
| Audio-Visual Classification | CREMA-D (test) | Accuracy60.07 | 60 | |
| Image Retrieval | Fashion200k (test) | Recall@112.9 | 58 | |
| Multimodal Classification | KS (test) | Accuracy63.33 | 48 | |
| Multimodal Classification | MVSA (test) | Accuracy (%)75.34 | 48 |