Hellinger Multimodal Variational Autoencoders

About

Multimodal variational autoencoders (VAEs) are widely used for weakly supervised generative learning with multiple modalities. Predominant methods aggregate unimodal inference distributions using either a product of experts (PoE), a mixture of experts (MoE), or their combinations to approximate the joint posterior. In this work, we revisit multimodal inference through the lens of probabilistic opinion pooling, an optimization-based approach. We start from H\"older pooling with $\alpha=0.5$, which corresponds to the unique symmetric member of the $\alpha\text{-divergence}$ family, and derive a moment-matching approximation, termed Hellinger. We then leverage such an approximation to propose HELVAE, a multimodal VAE that avoids sub-sampling, yielding an efficient yet effective model that: (i) learns more expressive latent representations as additional modalities are observed; and (ii) empirically achieves better trade-offs between generative coherence and quality, outperforming state-of-the-art multimodal VAE models.

Huyen Vo, Isabel Valera• 2026

Related benchmarks

Task	Dataset	Result
Multimodal Synthesis	PolyMNIST	Synthesis Coherence91	26
Conditional Multi-component Generation	PolyMNIST	FID116.8	18
Unconditional Multi-component Generation	PolyMNIST	FID106	18

Showing 3 of 3 rows

Other info

Follow for update

@wizwand_team Discord