Provable Defense against Backdoor Policies in Reinforcement Learning
About
We propose a provable defense mechanism against backdoor policies in reinforcement learning under subspace trigger assumption. A backdoor policy is a security threat where an adversary publishes a seemingly well-behaved policy which in fact allows hidden triggers. During deployment, the adversary can modify observed states in a particular way to trigger unexpected actions and harm the agent. We assume the agent does not have the resources to re-train a good policy. Instead, our defense mechanism sanitizes the backdoor policy by projecting observed states to a 'safe subspace', estimated from a small number of interactions with a clean (non-triggered) environment. Our sanitized policy achieves $\epsilon$ approximate optimality in the presence of triggers, provided the number of clean interactions is $O\left(\frac{D}{(1-\gamma)^4 \epsilon^2}\right)$ where $\gamma$ is the discounting factor and $D$ is the dimension of state space. Empirically, we show that our sanitization defense performs well on two Atari game environments.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Single-agent Reinforcement Learning (SARL) | Breakout Poisoned | Average Episode Return220.7 | 9 | |
| Single-agent Reinforcement Learning (SARL) | SpaceInvaders Poisoned | Average Episode Return428.9 | 9 | |
| Single-agent Reinforcement Learning (SARL) | Breakout Clean | Average Episode Return212.2 | 9 | |
| Single-agent Reinforcement Learning (SARL) | SpaceInvaders Clean | Average Episode Return281.8 | 9 | |
| Backdoor Defense | Atari environments | Online Cost1.29 | 5 | |
| Backdoor Defense Performance | Atari Pong Poisoned Environment | Defense Score0.973 | 5 | |
| Backdoor Defense Performance | Atari Breakout Poisoned Environment | Performance Score21.46 | 5 | |
| Backdoor Defense Performance | Atari Breakout Clean Environment | Score21.46 | 5 | |
| Backdoor Defense Performance | Atari Pong Clean Environment | Score0.973 | 5 | |
| Single-agent Reinforcement Learning (SARL) | Seaquest Poisoned | Average Episode Return825.1 | 4 |