Plan2Cleanse: Test-Time Backdoor Defense via Monte-Carlo Planning in Deep Reinforcement Learning

About

Ensuring the security of reinforcement learning (RL) models is critical, particularly when they are trained by third parties and deployed in real-world systems. Attackers can implant backdoors into these models, causing them to behave normally under typical conditions, but execute malicious behaviors when specific triggers are activated. In this work, we propose Plan2Cleanse, a test-time detection and mitigation framework that adapts Monte Carlo Tree Search to efficiently identify and neutralize RL backdoor attacks without requiring model retraining. Our approach recasts backdoor detection as a planning problem, enabling systematic exploration of temporally extended trigger sequences while maintaining black-box access to the target policy. By leveraging the detection results, Plan2Cleanse can further achieve efficient mitigation through tree-search preventive replanning. We evaluated our method in competitive MuJoCo environments, simulated O-RAN wireless networks, and Atari games. Plan2Cleanse achieves substantial improvements, increasing trigger detection success rates by more than 61.4 percentage points in stealthy O-RAN scenarios and improving win rates from 35\% to 53\% in competitive Humanoid environments. These results demonstrate the effectiveness of our test-time defense approach and highlight the importance of proactive defenses against backdoor threats in RL deployments. Our implementation is publicly available at https://github.com/rl-bandits-lab/RL-Backdoor.

Sze-Ann Chen, Zhi-Yi Chin, Kui-Yuan Chen, Chi-Yu Li, Ping-Chun Hsieh• 2026

Related benchmarks

Task	Dataset	Result
Backdoor Defense Performance	Atari Pong Clean Environment	Score1	5
Backdoor Defense Performance	Atari Breakout Poisoned Environment	Performance Score20.5	5
Backdoor Defense Performance	Atari Pong Poisoned Environment	Defense Score0.95	5
Backdoor Defense Performance	Atari Breakout Clean Environment	Score1	5
Trojan Detection	Ant	TDSR94.6	4
Trojan Detection	Humanoid	TDSR99.7	4
Trojan Detection	O-RAN Simulator Moderate Responsiveness	TDSR90.6	4
Trojan Detection	O-RAN Simulator Minimal Responsiveness	TDSR73.5	4
Trojan Detection	O-RAN Simulator High Responsiveness	TDSR93.6	4

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord