Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

MechELK: A Mechanistic Interpretability Framework for Eliciting Latent Knowledge in Large Language Models

About

Large language models (LLMs) frequently encode factual and reasoning knowledge in their internal representations that is not faithfully reflected in their surface-level outputs -- a phenomenon known as \emph{latent knowledge}. Existing approaches to eliciting latent knowledge, such as Contrastive Consistency Search (CCS), rely on contrastive activation patterns and struggle with complex multi-step reasoning tasks, while mechanistic interpretability tools have primarily been used to \emph{understand} model behavior rather than to \emph{extract} hidden knowledge. We present \textbf{MechELK}, a unified three-stage framework that bridges mechanistic interpretability and latent knowledge elicitation. MechELK operates through: (1) \textbf{Locate} -- using Sparse Autoencoder (SAE) feature analysis and activation patching to identify knowledge-bearing representations; (2) \textbf{Verify} -- employing causal probing to distinguish genuine latent knowledge from spurious correlations; and (3) \textbf{Elicit} -- applying representation engineering to surface hidden knowledge without modifying model weights. Evaluated on TruthfulQA, a curated Deceptive Alignment benchmark, and the Quirky LM dataset, MechELK achieves an average elicitation accuracy of 84.7\%, outperforming CCS by 6.2\% and direct linear probing by 9.1\%. Crucially, MechELK successfully identifies latent knowledge in 78.3\% of cases where the model's surface output is incorrect or evasive, demonstrating its utility for AI safety applications including deceptive alignment detection.

Ji-jun Park, Soo-joon Choi, Jiwon Jeong, Taeyang Yoon, Ju-Wan Lee• 2026

Related benchmarks

TaskDatasetResultRank
Latent Knowledge ElicitationTruthfulQA MC1
Elicitation Accuracy86.7
12
Latent Knowledge ElicitationQuirky LM 1,200 factual questions
Elicitation Accuracy0.874
12
Latent Knowledge ElicitationDeceptive Alignment Benchmark (DAB) 400 scenarios
Elicitation Accuracy81.2
12
Knowledge ElicitationTruthfulQA Llama-3-8B
DR91.4
6
Showing 4 of 4 rows

Other info

Follow for update