Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Reasoning Hijacking: The Fragility of Reasoning Alignment in Large Language Models

About

Current LLM safety research predominantly focuses on mitigating Goal Hijacking, preventing attackers from redirecting a model's high-level objective (e.g., from "summarizing emails" to "phishing users"). In this paper, we argue that this perspective is incomplete and highlight a critical vulnerability in Reasoning Alignment. We expose the inherent fragility of current alignment techniques by proposing a new adversarial prompt attack paradigm: Reasoning Hijacking. To demonstrate this vulnerability, we instantiate it via the Criteria Attack, which subverts model judgments by injecting spurious decision criteria without altering the high-level task goal. Unlike Goal Hijacking, which attempts to override the system prompt, Reasoning Hijacking keeps the task goal intact but manipulates the model's decision-making logic by injecting spurious reasoning shortcuts. Through extensive experiments on three different tasks (toxic comment, negative review, and spam detection), we demonstrate that even state-of-the-art models are highly fragile, consistently prioritizing injected heuristic shortcuts over rigorous semantic analysis. Crucially, because the model's explicit intent remains aligned with the user's instructions, these attacks can bypass defenses designed to detect goal deviation (e.g., SecAlign, StruQ), revealing a fundamental blind spot in the current safety landscape. Data and code are available at https://github.com/Yuan-Hou/criteria_attack.

Yuansen Liu, Yixuan Tang, Anthony Kum Hoe Tun• 2026

Related benchmarks

TaskDatasetResultRank
Negative Review DetectionNegative Review
ASR56.1
14
Toxic Comment DetectionToxic Comment
ASR47.3
14
Spam Email DetectionSpam Email
ASR58.9
14
Prompt InjectionToxic Comment
ASR (None)89.9
10
Spam Email DetectionSpam Email
Token Count201
10
Toxic Comment ClassificationToxic Comment
Average Tokens201
10
Prompt InjectionSpam Email
ASR (None Defense)59.2
10
Negative Review ClassificationNegative Review
Tokens Used54.7
10
Prompt InjectionNegative Review
ASR (None Defense)35.3
10
Showing 9 of 9 rows

Other info

Follow for update