Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Domain Restriction via Multi SAE Layer Transitions

About

The general-purpose nature of Large Language Models (LLMs) presents a significant challenge for domain-specific applications, often leading to out-of-domain (OOD) interactions that undermine the provider's intent. Existing methods for detecting such scenarios treat the LLM as an uninterpretable black box and overlook the internal processing of inputs. In this work we show that layer transitions provide a promising avenue for extracting domain-specific signature. Specifically, we present several lightweight ways of learning on internal dynamics encoded using a sparse autoencoder (SAE) that exhibit great capability in distinguishing OOD texts. Building on top of SAEs representation transitions enables us to better interpret the LLM internal evolution of input processing and shed light on its decisions. We provide a comprehensive analysis of the method and benchmark it with the gemma-2 2B and 9B models. Our results emphasize the efficacy of the internal process in capturing fine-grained input-related details.

Elias Shaheen, Avi Mendelson• 2026

Related benchmarks

TaskDatasetResultRank
Near-OOD DetectionAGNews one-vs-all runs
AUROC (c3)0.9
4
Near-OOD DetectionROSTD
AUROC95.32
4
Near-OOD DetectionSNIPS
AUROC96.99
4
Near-OOD DetectionCLINC150
AUROC83.64
4
Out-of-Distribution Detection20NG (ID) vs SST-2 (OOD)
AUROC0.9866
4
Out-of-Distribution Detection20NG (ID) vs MNLI (OOD)
AUROC0.9763
4
Out-of-Distribution Detection20NG (ID) vs RTE (OOD)
AUROC0.9884
4
Out-of-Distribution Detection20NG (ID) vs IMDB (OOD)
AUROC95.4
4
Showing 8 of 8 rows

Other info

Follow for update