Interpretability Without Tradeoffs: Disentangling Polysemanticity At Equal Predictive Performance

About

Deep neural networks (DNNs) are widely used, but interpreting what they actually learn remains difficult. A major obstacle is that individual neurons often encode multiple unrelated concepts, obscuring the decision process of the network. While prior work, such as sparse autoencoders, can separate these mixed signals into more meaningful, "monosemantic" features, this typically requires altering the model in ways that can degrade downstream performance. To overcome this, we introduce ELUDe (explicit, lossless, unsupervised disentanglement), a method for improving the interpretability of DNNs while preserving their functional equivalence. ELUDe breaks latent representations into clear, inspectable sub-units that behave like interpretable features, while guaranteeing that the model's outputs remain exactly the same. It requires no explicit training, no labels, and can be applied to pretrained models. ELUDe works by reorganizing how information flows between layers, re-routing concept-specific contributions while preserving the original computation by construction. Across several vision models, including DINOv2 and supervised ViT-B/16, ELUDe improves interpretability, keeps downstream accuracy unchanged, runs efficiently, and supports practical uses such as steering model representations. In short, ELUDe offers interpretability (almost) without a tradeoff: clearer, scalable, and actionable model insights with no loss in performance.

Do\u{g}ukan Ba\u{g}c{\i}, Bernt Schiele, Simone Schaub-Meyer, Jonas Fischer, Robin Hesse• 2026

Related benchmarks

Task	Dataset	Result
Interpretability and Faithfulness Evaluation	DINOv2 ViT-B/14 tokens	LLM Rank1	22
SAE Interpretability and Faithfulness Evaluation	DINOv2 ViT-B 14 Layer 12 activations	LLM Rank4	12
Interpretability and Faithfulness Evaluation	ImageNet 21k ViT-B/16 tokens	LLM Rank1	10
Computational Efficiency	DINOv2 ViT-B/14 (layer 11)	Latency (ms/batch)437.7	10
Dictionary Learning Stability and Geometry Evaluation	DINOv2-B/14 activations (three seeded token datasets)	Cross-Init Similarity93.34	9
SAE Interpretability and Faithfulness Evaluation	ConvNeXt Base activations V2	LLM Rank4	6
SAE Interpretability and Faithfulness Evaluation	DINOv2 ViT-L/14 activations	LLM Rank3	6
SAE Interpretability and Faithfulness Evaluation	SigLIP2-B 16 activations	LLM Rank2	6
SAE Interpretability and Faithfulness Evaluation	DINOv2 ViT-S 14 activations	LLM Rank1	6

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord