A Multi-Level Causal Intervention Framework for Mechanistic Interpretability in Variational Autoencoders

About

Understanding how generative models represent and transform data is a foundational problem in deep learning interpretability. While mechanistic interpretability of discriminative architectures has yielded substantial insights, relatively little work has addressed variational autoencoders (VAEs). This paper presents the first general-purpose multilevel causal intervention framework for mechanistic interpretability of VAEs. The framework comprises four manipulation types: input manipulation, latent-space perturbation, activation patching, and causal mediation analysis. We also define three new quantitative metrics capturing properties not measured by existing disentanglement metrics alone: Causal Effect Strength (CES), intervention specificity, and circuit modularity. We conduct the largest empirical study to date of VAE causal mechanisms across six architectures (standard VAE, beta-VAE, FactorVAE, beta-TC-VAE, DIP-VAE-II, and VQ-VAE) and five benchmarks (dSprites, 3DShapes, MPI3D, CelebA, and SmallNORB), with three seeds per configuration, totaling 90 independent training runs. Our results reveal several findings: (i) a consistent within-dataset negative correlation between CES and DCI disentanglement (the CES-DCI trade-off); (ii) that the KL reweighting mechanism of beta-VAE induces a capacity bottleneck when generative factors approach latent dimensionality, degrading disentanglement on complex datasets; (iii) that no single VAE architecture dominates across all five datasets, with optimal choice depending on dataset structure; and (iv) that CES-based metrics applied to discrete latent spaces (VQ-VAE) yield near-zero values, revealing a critical limitation of continuous-intervention methods for discrete representations. These results provide both a theoretical foundation and comprehensive empirical evaluation for mechanistic interpretability of generative models.

Dip Roy, Rajiv Misra, Sanjay Kumar Singh, Anisha Roy• 2025

Related benchmarks

Task	Dataset	Result
Disentanglement	Shapes3D (test)	--	28
Disentanglement	3DShapes	--	22
Disentangled Representation Learning	dSprites	--	15
Disentanglement	MPI3D	--	14
Circuit Modularity	dSprites	--	6
Circuit Modularity	3DShapes	--	6
Circuit Modularity	CelebA	--	6
Circuit Modularity	Smallnorb	--	6
Disentangled Representation Learning	MPI3D	--	6
Disentangled Representation Learning	CelebA	--	6

Showing 10 of 15 rows

Other info

Follow for update

@wizwand_team Discord