A Multi-Level Causal Intervention Framework for Mechanistic Interpretability in Variational Autoencoders
About
Understanding how generative models represent and transform data is a foundational problem in deep learning interpretability. While mechanistic interpretability of discriminative architectures has yielded substantial insights, relatively little work has addressed variational autoencoders (VAEs). This paper presents the first general-purpose multilevel causal intervention framework for mechanistic interpretability of VAEs. The framework comprises four manipulation types: input manipulation, latent-space perturbation, activation patching, and causal mediation analysis. We also define three new quantitative metrics capturing properties not measured by existing disentanglement metrics alone: Causal Effect Strength (CES), intervention specificity, and circuit modularity. We conduct the largest empirical study to date of VAE causal mechanisms across six architectures (standard VAE, beta-VAE, FactorVAE, beta-TC-VAE, DIP-VAE-II, and VQ-VAE) and five benchmarks (dSprites, 3DShapes, MPI3D, CelebA, and SmallNORB), with three seeds per configuration, totaling 90 independent training runs. Our results reveal several findings: (i) a consistent within-dataset negative correlation between CES and DCI disentanglement (the CES-DCI trade-off); (ii) that the KL reweighting mechanism of beta-VAE induces a capacity bottleneck when generative factors approach latent dimensionality, degrading disentanglement on complex datasets; (iii) that no single VAE architecture dominates across all five datasets, with optimal choice depending on dataset structure; and (iv) that CES-based metrics applied to discrete latent spaces (VQ-VAE) yield near-zero values, revealing a critical limitation of continuous-intervention methods for discrete representations. These results provide both a theoretical foundation and comprehensive empirical evaluation for mechanistic interpretability of generative models.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Disentanglement | 3DShapes | -- | 22 | |
| Disentanglement | Shapes3D (test) | -- | 19 | |
| Disentangled Representation Learning | dSprites | -- | 15 | |
| Disentanglement | MPI3D | -- | 14 | |
| Circuit Modularity | dSprites | -- | 6 | |
| Circuit Modularity | 3DShapes | -- | 6 | |
| Circuit Modularity | CelebA | -- | 6 | |
| Circuit Modularity | Smallnorb | -- | 6 | |
| Disentangled Representation Learning | MPI3D | -- | 6 | |
| Disentangled Representation Learning | CelebA | -- | 6 |