Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Hellinger Multimodal Variational Autoencoders

About

Multimodal variational autoencoders (VAEs) are widely used for weakly supervised generative learning with multiple modalities. Predominant methods aggregate unimodal inference distributions using either a product of experts (PoE), a mixture of experts (MoE), or their combinations to approximate the joint posterior. In this work, we revisit multimodal inference through the lens of probabilistic opinion pooling, an optimization-based approach. We start from H\"older pooling with $\alpha=0.5$, which corresponds to the unique symmetric member of the $\alpha\text{-divergence}$ family, and derive a moment-matching approximation, termed Hellinger. We then leverage such an approximation to propose HELVAE, a multimodal VAE that avoids sub-sampling, yielding an efficient yet effective model that: (i) learns more expressive latent representations as additional modalities are observed; and (ii) empirically achieves better trade-offs between generative coherence and quality, outperforming state-of-the-art multimodal VAE models.

Huyen Vo, Isabel Valera• 2026

Related benchmarks

TaskDatasetResultRank
Multimodal SynthesisPolyMNIST
Synthesis Coherence91
26
Conditional Multi-component GenerationPolyMNIST
FID116.8
18
Unconditional Multi-component GenerationPolyMNIST
FID106
18
Showing 3 of 3 rows

Other info

Follow for update