Domain Restriction via Multi SAE Layer Transitions

About

The general-purpose nature of Large Language Models (LLMs) presents a significant challenge for domain-specific applications, often leading to out-of-domain (OOD) interactions that undermine the provider's intent. Existing methods for detecting such scenarios treat the LLM as an uninterpretable black box and overlook the internal processing of inputs. In this work we show that layer transitions provide a promising avenue for extracting domain-specific signature. Specifically, we present several lightweight ways of learning on internal dynamics encoded using a sparse autoencoder (SAE) that exhibit great capability in distinguishing OOD texts. Building on top of SAEs representation transitions enables us to better interpret the LLM internal evolution of input processing and shed light on its decisions. We provide a comprehensive analysis of the method and benchmark it with the gemma-2 2B and 9B models. Our results emphasize the efficacy of the internal process in capturing fine-grained input-related details.

Elias Shaheen, Avi Mendelson• 2026

Related benchmarks

Task	Dataset	Result
Near-OOD Detection	AGNews one-vs-all runs	AUROC (c3)0.9	4
Near-OOD Detection	ROSTD	AUROC95.32	4
Near-OOD Detection	SNIPS	AUROC96.99	4
Near-OOD Detection	CLINC150	AUROC83.64	4
Out-of-Distribution Detection	20NG (ID) vs SST-2 (OOD)	AUROC0.9866	4
Out-of-Distribution Detection	20NG (ID) vs MNLI (OOD)	AUROC0.9763	4
Out-of-Distribution Detection	20NG (ID) vs RTE (OOD)	AUROC0.9884	4
Out-of-Distribution Detection	20NG (ID) vs IMDB (OOD)	AUROC95.4	4

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord