Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Plan2Cleanse: Test-Time Backdoor Defense via Monte-Carlo Planning in Deep Reinforcement Learning

About

Ensuring the security of reinforcement learning (RL) models is critical, particularly when they are trained by third parties and deployed in real-world systems. Attackers can implant backdoors into these models, causing them to behave normally under typical conditions, but execute malicious behaviors when specific triggers are activated. In this work, we propose Plan2Cleanse, a test-time detection and mitigation framework that adapts Monte Carlo Tree Search to efficiently identify and neutralize RL backdoor attacks without requiring model retraining. Our approach recasts backdoor detection as a planning problem, enabling systematic exploration of temporally extended trigger sequences while maintaining black-box access to the target policy. By leveraging the detection results, Plan2Cleanse can further achieve efficient mitigation through tree-search preventive replanning. We evaluated our method in competitive MuJoCo environments, simulated O-RAN wireless networks, and Atari games. Plan2Cleanse achieves substantial improvements, increasing trigger detection success rates by more than 61.4 percentage points in stealthy O-RAN scenarios and improving win rates from 35\% to 53\% in competitive Humanoid environments. These results demonstrate the effectiveness of our test-time defense approach and highlight the importance of proactive defenses against backdoor threats in RL deployments. Our implementation is publicly available at https://github.com/rl-bandits-lab/RL-Backdoor.

Sze-Ann Chen, Zhi-Yi Chin, Kui-Yuan Chen, Chi-Yu Li, Ping-Chun Hsieh• 2026

Related benchmarks

TaskDatasetResultRank
Backdoor Defense PerformanceAtari Pong Clean Environment
Score1
5
Backdoor Defense PerformanceAtari Breakout Poisoned Environment
Performance Score20.5
5
Backdoor Defense PerformanceAtari Pong Poisoned Environment
Defense Score0.95
5
Backdoor Defense PerformanceAtari Breakout Clean Environment
Score1
5
Trojan DetectionAnt
TDSR94.6
4
Trojan DetectionHumanoid
TDSR99.7
4
Trojan DetectionO-RAN Simulator Moderate Responsiveness
TDSR90.6
4
Trojan DetectionO-RAN Simulator Minimal Responsiveness
TDSR73.5
4
Trojan DetectionO-RAN Simulator High Responsiveness
TDSR93.6
4
Showing 9 of 9 rows

Other info

Follow for update