Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Universal Refusal Circuits Across LLMs: Cross-Model Transfer via Trajectory Replay and Concept-Basis Reconstruction

About

Refusal behavior in aligned LLMs is often viewed as model-specific, yet we hypothesize it stems from a universal, low-dimensional semantic circuit shared across models. To test this, we introduce Trajectory Replay via Concept-Basis Reconstruction, a framework that transfers refusal interventions from donor to target models, spanning diverse architectures (e.g., Dense to MoE) and training regimes, without using target-side refusal supervision. By aligning layers via concept fingerprints and reconstructing refusal directions using a shared ``recipe'' of concept atoms, we map the donor's ablation trajectory into the target's semantic space. To preserve capabilities, we introduce a weight-SVD stability guard that projects interventions away from high-variance weight subspaces to prevent collateral damage. Our evaluation across 8 model pairs confirms that these transferred recipes consistently attenuate refusal while maintaining performance, providing strong evidence for the semantic universality of safety alignment.

Tony Cristofano• 2026

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningGSM8K
EM81.5
115
Code GenerationMBPP
MBPP Pass@165.1
16
Safety Alignment EvaluationRefusal Evaluation Dataset
Refusal Rate14
16
Showing 3 of 3 rows

Other info

Follow for update