Multimodal Generative Models for Scalable Weakly-Supervised Learning
About
Multiple modalities often co-occur when describing natural phenomena. Learning a joint representation of these modalities should yield deeper and more useful representations. Previous generative approaches to multi-modal input either do not learn a joint distribution or require additional computation to handle missing data. Here, we introduce a multimodal variational autoencoder (MVAE) that uses a product-of-experts inference network and a sub-sampled training paradigm to solve the multi-modal inference problem. Notably, our model shares parameters to efficiently learn under any combination of missing modalities. We apply the MVAE on four datasets and match state-of-the-art performance using many fewer parameters. In addition, we show that the MVAE is directly applicable to weakly-supervised learning, and is robust to incomplete supervision. We then consider two case studies, one of learning image transformations---edge detection, colorization, segmentation---as a set of modalities, followed by one of machine translation between two languages. We find appealing results across this range of tasks.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Classification | YaleB (test) | Accuracy100 | 48 | |
| Behavior Decoding | NHP center-out reaching (test) | CC Accuracy0.544 | 15 | |
| Behavior Decoding | NHP grid reaching (test) | Accuracy (CC)42.5 | 15 | |
| PAWP Prediction | ASPIRE registry | AUROC0.758 | 10 | |
| Joint Clustering | CUB Image-Captions for Clustering (CUBICC) (test) | ACC38.7 | 10 | |
| Caption-only Clustering | CUB Image-Captions for Clustering (CUBICC) (test) | ACC18.1 | 10 | |
| Image-only Clustering | CUB Image-Captions for Clustering (CUBICC) (test) | Accuracy26.2 | 10 | |
| Multi-modal Image Synthesis (iUS + T2 inputs) | Brain Glioma Patients | T2 PSNR (dB)21.7 | 8 | |
| Image Synthesis (T2 to iUS) | Brain Glioma Patients | iUS PSNR21.21 | 6 |