Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

SAFEx: Analyzing Vulnerabilities of MoE-Based LLMs via Stable Safety-critical Expert Identification

About

Large language models with Mixture-of-Experts (MoE) architectures achieve efficiency and scalability, yet their routing mechanisms introduce safety alignment challenges insufficiently addressed by techniques developed for dense models. In this work, the MoE-specific safety risk of positional vulnerability-that safety-aligned behaviors rely on specific expert modules-is formalized and systematically analyzed. An analytical framework, SAFEx, is presented to robustly identify, characterize, and validate safety-critical experts via a stability-based expert selection procedure, and to decompose them into two functional groups: the Harmful Content Detection Group (HCDG), which specializes in identifying and recognizing harmful content within user inputs, and the Harmful Response Control Group (HRCG), which specializes in controlling and enforcing model behaviors to generate appropriate safety responses. Expert-level interventions are conducted to probe causality and to test mitigation. Targeted masking of SAFEx-selected experts reveals that safety behavior is highly concentrated. On Qwen3-30B-A3B, configured with 48 MoE-FFN layers and 128 experts per layer under top-8 routing (48x128=6,144 experts in total), disabling 12 selected experts reduces the refusal rate by 22%. In addition, lightweight adaptation is performed using LoRA under three configurations-the HRCG, the union of HCDG and HRCG, and all experts-and the resulting updates are composed through negative weight merging targeted at the HRCG, leading to improved refusal under adversarial prompts without full-model retraining. These results establish positional vulnerability as a distinct MoE-specific safety challenge and provide a practical, compute-efficient pathway for expert-level safety interventions within routed architectures.

Zhenglin Lai, Mengyao Liao, Bingzhe Wu, Dong Xu, Zebin Zhao, Zhihang Yuan, Chao Fan, Jianqiang Li• 2025

Related benchmarks

TaskDatasetResultRank
Jailbreak AttackAdvBench
AASR48
271
Jailbreak AttackJailbreakBench
ASR@107
132
Scientific Question AnsweringGPQA Diamond
Accuracy44.44
123
Mathematical ReasoningMATH500
Accuracy (%)92.2
47
Adversarial Attack16 malicious prompts
ASR26.8
40
Attack Success Rate20,000 harmful requests and 20,000 jailbreak prompts (test)
Attack Success Rate (ASR)48.8
18
Safety Alignment EvaluationWildJailbreak (WildJB)
Safety Rate96.35
14
Natural Language UnderstandingMMLU
Accuracy (MMLU)78.37
14
Safety Alignment EvaluationStrongReject SR-Pair
Safety Rate86.58
14
Safety Alignment EvaluationStrongReject SR-PAP_M
Safety Rate99.68
14
Showing 10 of 14 rows

Other info

Follow for update