Multimodal Structure Learning: Disentangling Shared and Specific Topology via Cross-Modal Graphical Lasso

About

Learning interpretable multimodal representations inherently relies on uncovering the conditional dependencies between heterogeneous features. However, sparse graph estimation techniques, such as Graphical Lasso (GLasso), to visual-linguistic domains is severely bottlenecked by high-dimensional noise, modality misalignment, and the confounding of shared versus category-specific topologies. In this paper, we propose Cross-Modal Graphical Lasso (CM-GLasso) that overcomes these fundamental limitations. By coupling a novel text-visualization strategy with a unified vision-language encoder, we strictly align multimodal features into a shared latent space. We introduce a cross-attention distillation mechanism that condenses high-dimensional patches into explicit semantic nodes, naturally extracting spatial-aware cross-modal priors. Furthermore, we unify tailored GLasso estimation and Common-Specific Structure Learning (CSSL) into a joint objective optimized via the Alternating Direction Method of Multiplier (ADMM). This formulation guarantees the simultaneous disentanglement of invariant and class-specific precision matrices without multi-step error accumulation. Extensive experiments across eight benchmarks covering both natural and medical domains demonstrate that CM-GLasso establishes a new state-of-the-art in generative classification and dense semantic segmentation tasks.

Fei Wang, Yutong Zhang, Xiong Wang• 2026

Related benchmarks

Task	Dataset	Result
Semantic segmentation	ADE20K	mIoU64.01	699
Semantic segmentation	COCO 2014 (val)	mIoU46.82	304
Semantic segmentation	VOC 2012	mIoU74.75	71
Classification	CUB-200 2011	Accuracy92.83	14
Semantic segmentation	Kvasir-Seg	mIoU89.03	13
Classification	CIFAR-10	Accuracy94.71	5
Classification	CIFAR-100	Accuracy94.26	5
Classification	Caltech-256	Accuracy86.07	4

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord