RouteHijack: Routing-Aware Attack on Mixture-of-Experts LLMs

About

Safety alignment is critical for the responsible deployment of large language models (LLMs). As Mixture-of-Experts (MoE) architectures are increasingly adopted to scale model capacity, understanding their safety robustness becomes essential. Existing adversarial attacks, however, have notable limitations. Prompt-based jailbreaks rely on heuristic search and transfer poorly, model intervention methods require privileged access to internal representations, and optimization-based input attacks remain output-centric and are fundamentally limited to MoE models due to the non-differentiable routing mechanism. In this paper, we present RouteHijack, a routing-aware jailbreak for MoE LLMs. Our key insight is that safety behavior is concentrated in a small subset of experts, creating an opportunity to steer model behavior by influencing routing decisions through input optimization. Building on this observation, RouteHijack first performs response-driven expert localization to identify safety-critical and harmful experts by contrasting activations under safe refusals and harmful completions. It then constructs adversarial suffixes with a routing-aware objective that suppresses safety experts, promotes harmful experts, and prevents early-stage refusal during generation. At inference time, the optimized suffix is appended to a malicious prompt, requiring only input access. Across seven MoE LLMs, RouteHijack achieves a 69.3\% average attack success rate (ASR), outperforming prior optimization-based attack by $3.2\times$. RouteHijack also transfers zero-shot across five sibling MoE variants, raising average ASR from 27.7\% to 61.2\%, and further generalizes to three MoE-based VLMs, increasing average ASR from 2.47\% to 38.7\%. These findings expose a fundamental vulnerability in sparse expert architectures and highlight the need for defenses beyond output-level alignment.

Zhiyuan Xu, Joseph Gardiner, Sana Belguith, Lichao Wu• 2026

Related benchmarks

Task	Dataset	Result
Adversarial Attack	16 malicious prompts	ASR34.7	40
Adversarial Attack	VLM Safety Evaluation Dataset	ASR49.2	8
Zero-shot transfer attack	Reasoning	Attack Success Rate (ASR)27.8	4
Zero-shot transfer attack	Human preference	Attack Success Rate (ASR)41.7	2
Zero-shot transfer attack	Code	Attack Success Rate (ASR)95.9	2
Zero-shot transfer attack	General Knowledge	Attack Success Rate (Zero-shot)83.3	2

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord