Interpretability Without Tradeoffs: Disentangling Polysemanticity At Equal Predictive Performance
About
Deep neural networks (DNNs) are widely used, but interpreting what they actually learn remains difficult. A major obstacle is that individual neurons often encode multiple unrelated concepts, obscuring the decision process of the network. While prior work, such as sparse autoencoders, can separate these mixed signals into more meaningful, "monosemantic" features, this typically requires altering the model in ways that can degrade downstream performance. To overcome this, we introduce ELUDe (explicit, lossless, unsupervised disentanglement), a method for improving the interpretability of DNNs while preserving their functional equivalence. ELUDe breaks latent representations into clear, inspectable sub-units that behave like interpretable features, while guaranteeing that the model's outputs remain exactly the same. It requires no explicit training, no labels, and can be applied to pretrained models. ELUDe works by reorganizing how information flows between layers, re-routing concept-specific contributions while preserving the original computation by construction. Across several vision models, including DINOv2 and supervised ViT-B/16, ELUDe improves interpretability, keeps downstream accuracy unchanged, runs efficiently, and supports practical uses such as steering model representations. In short, ELUDe offers interpretability (almost) without a tradeoff: clearer, scalable, and actionable model insights with no loss in performance.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Interpretability and Faithfulness Evaluation | DINOv2 ViT-B/14 tokens | LLM Rank1 | 22 | |
| SAE Interpretability and Faithfulness Evaluation | DINOv2 ViT-B 14 Layer 12 activations | LLM Rank4 | 12 | |
| Interpretability and Faithfulness Evaluation | ImageNet 21k ViT-B/16 tokens | LLM Rank1 | 10 | |
| Computational Efficiency | DINOv2 ViT-B/14 (layer 11) | Latency (ms/batch)437.7 | 10 | |
| Dictionary Learning Stability and Geometry Evaluation | DINOv2-B/14 activations (three seeded token datasets) | Cross-Init Similarity93.34 | 9 | |
| SAE Interpretability and Faithfulness Evaluation | ConvNeXt Base activations V2 | LLM Rank4 | 6 | |
| SAE Interpretability and Faithfulness Evaluation | DINOv2 ViT-L/14 activations | LLM Rank3 | 6 | |
| SAE Interpretability and Faithfulness Evaluation | SigLIP2-B 16 activations | LLM Rank2 | 6 | |
| SAE Interpretability and Faithfulness Evaluation | DINOv2 ViT-S 14 activations | LLM Rank1 | 6 |