Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

A Multi-Level Causal Intervention Framework for Mechanistic Interpretability in Variational Autoencoders

About

Understanding how generative models represent and transform data is a foundational problem in deep learning interpretability. While mechanistic interpretability of discriminative architectures has yielded substantial insights, relatively little work has addressed variational autoencoders (VAEs). This paper presents the first general-purpose multilevel causal intervention framework for mechanistic interpretability of VAEs. The framework comprises four manipulation types: input manipulation, latent-space perturbation, activation patching, and causal mediation analysis. We also define three new quantitative metrics capturing properties not measured by existing disentanglement metrics alone: Causal Effect Strength (CES), intervention specificity, and circuit modularity. We conduct the largest empirical study to date of VAE causal mechanisms across six architectures (standard VAE, beta-VAE, FactorVAE, beta-TC-VAE, DIP-VAE-II, and VQ-VAE) and five benchmarks (dSprites, 3DShapes, MPI3D, CelebA, and SmallNORB), with three seeds per configuration, totaling 90 independent training runs. Our results reveal several findings: (i) a consistent within-dataset negative correlation between CES and DCI disentanglement (the CES-DCI trade-off); (ii) that the KL reweighting mechanism of beta-VAE induces a capacity bottleneck when generative factors approach latent dimensionality, degrading disentanglement on complex datasets; (iii) that no single VAE architecture dominates across all five datasets, with optimal choice depending on dataset structure; and (iv) that CES-based metrics applied to discrete latent spaces (VQ-VAE) yield near-zero values, revealing a critical limitation of continuous-intervention methods for discrete representations. These results provide both a theoretical foundation and comprehensive empirical evaluation for mechanistic interpretability of generative models.

Dip Roy, Rajiv Misra, Sanjay Kumar Singh, Anisha Roy• 2025

Related benchmarks

TaskDatasetResultRank
Disentanglement3DShapes--
22
DisentanglementShapes3D (test)--
19
Disentangled Representation LearningdSprites--
15
DisentanglementMPI3D--
14
Circuit ModularitydSprites--
6
Circuit Modularity3DShapes--
6
Circuit ModularityCelebA--
6
Circuit ModularitySmallnorb--
6
Disentangled Representation LearningMPI3D--
6
Disentangled Representation LearningCelebA--
6
Showing 10 of 15 rows

Other info

Follow for update