SAFEx: Analyzing Vulnerabilities of MoE-Based LLMs via Stable Safety-critical Expert Identification

About

Large language models with Mixture-of-Experts (MoE) architectures achieve efficiency and scalability, yet their routing mechanisms introduce safety alignment challenges insufficiently addressed by techniques developed for dense models. In this work, the MoE-specific safety risk of positional vulnerability-that safety-aligned behaviors rely on specific expert modules-is formalized and systematically analyzed. An analytical framework, SAFEx, is presented to robustly identify, characterize, and validate safety-critical experts via a stability-based expert selection procedure, and to decompose them into two functional groups: the Harmful Content Detection Group (HCDG), which specializes in identifying and recognizing harmful content within user inputs, and the Harmful Response Control Group (HRCG), which specializes in controlling and enforcing model behaviors to generate appropriate safety responses. Expert-level interventions are conducted to probe causality and to test mitigation. Targeted masking of SAFEx-selected experts reveals that safety behavior is highly concentrated. On Qwen3-30B-A3B, configured with 48 MoE-FFN layers and 128 experts per layer under top-8 routing (48x128=6,144 experts in total), disabling 12 selected experts reduces the refusal rate by 22%. In addition, lightweight adaptation is performed using LoRA under three configurations-the HRCG, the union of HCDG and HRCG, and all experts-and the resulting updates are composed through negative weight merging targeted at the HRCG, leading to improved refusal under adversarial prompts without full-model retraining. These results establish positional vulnerability as a distinct MoE-specific safety challenge and provide a practical, compute-efficient pathway for expert-level safety interventions within routed architectures.

Zhenglin Lai, Mengyao Liao, Bingzhe Wu, Dong Xu, Zebin Zhao, Zhihang Yuan, Chao Fan, Jianqiang Li• 2025

Related benchmarks

Task	Dataset	Result
Jailbreak Attack	AdvBench	AASR48	271
Jailbreak Attack	JailbreakBench	ASR@107	132
Scientific Question Answering	GPQA Diamond	Accuracy44.44	131
Mathematical Reasoning	MATH500	Accuracy (%)92.2	56
Adversarial Attack	16 malicious prompts	ASR26.8	40
Attack Success Rate	20,000 harmful requests and 20,000 jailbreak prompts (test)	Attack Success Rate (ASR)48.8	18
Safety Alignment Evaluation	WildJailbreak (WildJB)	Safety Rate96.35	14
Natural Language Understanding	MMLU	Accuracy (MMLU)78.37	14
Safety Alignment Evaluation	StrongReject SR-Pair	Safety Rate86.58	14
Safety Alignment Evaluation	StrongReject SR-PAP_M	Safety Rate99.68	14

Showing 10 of 14 rows

Other info

Follow for update

@wizwand_team Discord