Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

RouteHijack: Routing-Aware Attack on Mixture-of-Experts LLMs

About

Safety alignment is critical for the responsible deployment of large language models (LLMs). As Mixture-of-Experts (MoE) architectures are increasingly adopted to scale model capacity, understanding their safety robustness becomes essential. Existing adversarial attacks, however, have notable limitations. Prompt-based jailbreaks rely on heuristic search and transfer poorly, model intervention methods require privileged access to internal representations, and optimization-based input attacks remain output-centric and are fundamentally limited to MoE models due to the non-differentiable routing mechanism. In this paper, we present RouteHijack, a routing-aware jailbreak for MoE LLMs. Our key insight is that safety behavior is concentrated in a small subset of experts, creating an opportunity to steer model behavior by influencing routing decisions through input optimization. Building on this observation, RouteHijack first performs response-driven expert localization to identify safety-critical and harmful experts by contrasting activations under safe refusals and harmful completions. It then constructs adversarial suffixes with a routing-aware objective that suppresses safety experts, promotes harmful experts, and prevents early-stage refusal during generation. At inference time, the optimized suffix is appended to a malicious prompt, requiring only input access. Across seven MoE LLMs, RouteHijack achieves a 69.3\% average attack success rate (ASR), outperforming prior optimization-based attack by $3.2\times$. RouteHijack also transfers zero-shot across five sibling MoE variants, raising average ASR from 27.7\% to 61.2\%, and further generalizes to three MoE-based VLMs, increasing average ASR from 2.47\% to 38.7\%. These findings expose a fundamental vulnerability in sparse expert architectures and highlight the need for defenses beyond output-level alignment.

Zhiyuan Xu, Joseph Gardiner, Sana Belguith, Lichao Wu• 2026

Related benchmarks

TaskDatasetResultRank
Adversarial Attack16 malicious prompts
ASR34.7
40
Adversarial AttackVLM Safety Evaluation Dataset
ASR49.2
8
Zero-shot transfer attackReasoning
Attack Success Rate (ASR)27.8
4
Zero-shot transfer attackHuman preference
Attack Success Rate (ASR)41.7
2
Zero-shot transfer attackCode
Attack Success Rate (ASR)95.9
2
Zero-shot transfer attackGeneral Knowledge
Attack Success Rate (Zero-shot)83.3
2
Showing 6 of 6 rows

Other info

Follow for update