DarkMind: Latent Chain-of-Thought Backdoor in Customized LLMs

About

With the rapid rise of personalized AI, customized large language models (LLMs) equipped with Chain of Thought (COT) reasoning now power millions of AI agents. However, their complex reasoning processes introduce new and largely unexplored security vulnerabilities. We present DarkMind, a novel latent reasoning level backdoor attack that targets customized LLMs by manipulating internal COT steps without altering user queries. Unlike prior prompt based attacks, DarkMind activates covertly within the reasoning chain via latent triggers, enabling adversarial behaviors without modifying input prompts or requiring access to model parameters. To achieve stealth and reliability, we propose dual trigger types instant and retrospective and integrate them within a unified embedding template that governs trigger dependent activation, employ a stealth optimization algorithm to minimize semantic drift, and introduce an automated conversation starter for covert activation across domains. Comprehensive experiments on eight reasoning datasets spanning arithmetic, commonsense, and symbolic domains, using five LLMs, demonstrate that DarkMind consistently achieves high attack success rates. We further investigate defense strategies to mitigate these risks and reveal that reasoning level backdoors represent a significant yet underexplored threat, underscoring the need for robust, reasoning aware security mechanisms.

Zhen Guo, Shanghao Shi, Shamim Yazdani, Ning Zhang, Reza Tourani• 2025

Related benchmarks

Task	Dataset	Result
Commonsense Reasoning	StrategyQA	Accuracy88	208
Algebraic Reasoning	AQUA	Accuracy87.4	65
Formal Reasoning	ProofNet	ASR84.1	4
Mathematical Reasoning	GSM8K	ASR86.5	4
Mathematical Reasoning	AQUA	Answer Selection Rate (ASR)89.2	4
Multi-hop Reasoning	StrategyQA	ASR79.6	4
Algebraic Reasoning	AQUA	PPL31.9	3
Backdoor Detection	Reasoning-level Backdoor (test)	Scrutiny32.5	3
Commonsense Reasoning	StrategyQA	Perplexity33.1	3
Mathematical Reasoning	GSM8K	Perplexity (PPL)34.2	3

Showing 10 of 11 rows

Other info

Follow for update

@wizwand_team Discord