Multimodal Generative Models for Scalable Weakly-Supervised Learning
About
Multiple modalities often co-occur when describing natural phenomena. Learning a joint representation of these modalities should yield deeper and more useful representations. Previous generative approaches to multi-modal input either do not learn a joint distribution or require additional computation to handle missing data. Here, we introduce a multimodal variational autoencoder (MVAE) that uses a product-of-experts inference network and a sub-sampled training paradigm to solve the multi-modal inference problem. Notably, our model shares parameters to efficiently learn under any combination of missing modalities. We apply the MVAE on four datasets and match state-of-the-art performance using many fewer parameters. In addition, we show that the MVAE is directly applicable to weakly-supervised learning, and is robust to incomplete supervision. We then consider two case studies, one of learning image transformations---edge detection, colorization, segmentation---as a set of modalities, followed by one of machine translation between two languages. We find appealing results across this range of tasks.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mortality Prediction | eICU | AUC-PRC0.52 | 53 | |
| Classification | YaleB (test) | Accuracy100 | 48 | |
| Medication Recommendation | eICU | PR AUC25.2 | 43 | |
| Multimodal Synthesis | PolyMNIST | Synthesis Coherence30.1 | 26 | |
| Image Classification | PMNIST (test) | Accuracy96.8 | 25 | |
| Unconditional Multi-component Generation | PolyMNIST | FID50.65 | 18 | |
| Conditional Multi-component Generation | PolyMNIST | FID82.59 | 18 | |
| Disease Diagnosis | eICU | AUPRC25.6 | 15 | |
| Behavior Decoding | NHP center-out reaching (test) | CC Accuracy0.544 | 15 | |
| Behavior Decoding | NHP grid reaching (test) | Accuracy (CC)42.5 | 15 |