Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Deep Multimodal Neural Architecture Search

About

Designing effective neural networks is fundamentally important in deep multimodal learning. Most existing works focus on a single task and design neural architectures manually, which are highly task-specific and hard to generalize to different tasks. In this paper, we devise a generalized deep multimodal neural architecture search (MMnas) framework for various multimodal learning tasks. Given multimodal input, we first define a set of primitive operations, and then construct a deep encoder-decoder based unified backbone, where each encoder or decoder block corresponds to an operation searched from a predefined operation pool. On top of the unified backbone, we attach task-specific heads to tackle different multimodal learning tasks. By using a gradient-based NAS algorithm, the optimal architectures for different tasks are learned efficiently. Extensive ablation studies, comprehensive analysis, and comparative experimental results show that the obtained MMnasNet significantly outperforms existing state-of-the-art approaches across three multimodal learning tasks (over five datasets), including visual question answering, image-text matching, and visual grounding.

Zhou Yu, Yuhao Cui, Jun Yu, Meng Wang, Dacheng Tao, Qi Tian• 2020

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringVQA v2 (test-dev)
Overall Accuracy71.24
664
Visual Question AnsweringVQA v2 (test-std)--
466
Image RetrievalFlickr30k (test)
R@160.7
195
Visual GroundingRefCOCO+ (val)
Accuracy74.7
171
Visual GroundingRefCOCO+ (testB)
Accuracy65.2
169
Visual GroundingRefCOCO+ (testA)
Accuracy81
168
Visual GroundingRefCOCO (testB)
Accuracy78.4
125
Visual GroundingRefCOCO (val)
Accuracy84.2
119
Visual GroundingRefCOCO (testA)
Accuracy87.4
117
Visual GroundingRefCOCOg (test)
Accuracy75.7
96
Showing 10 of 12 rows

Other info

Follow for update