Efficient Symbolic Policy Learning with Differentiable Symbolic Expression
About
Deep reinforcement learning (DRL) has led to a wide range of advances in sequential decision-making tasks. However, the complexity of neural network policies makes it difficult to understand and deploy with limited computational resources. Currently, employing compact symbolic expressions as symbolic policies is a promising strategy to obtain simple and interpretable policies. Previous symbolic policy methods usually involve complex training processes and pre-trained neural network policies, which are inefficient and limit the application of symbolic policies. In this paper, we propose an efficient gradient-based learning method named Efficient Symbolic Policy Learning (ESPL) that learns the symbolic policy from scratch in an end-to-end way. We introduce a symbolic network as the search space and employ a path selector to find the compact symbolic policy. By doing so we represent the policy with a differentiable symbolic expression and train it in an off-policy manner which further improves the efficiency. In addition, in contrast with previous symbolic policies which only work in single-task RL because of complexity, we expand ESPL on meta-RL to generate symbolic policies for unseen tasks. Experimentally, we show that our approach generates symbolic policies with higher performance and greatly improves data efficiency for single-task RL. In meta-RL, we demonstrate that compared with neural network policies the proposed symbolic policy achieves higher performance and efficiency and shows the potential to be interpretable.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Reinforcement Learning | Pendulum | Avg Episode Reward-151.7 | 15 | |
| Reinforcement Learning | Hopper | Avg Episode Reward2.44e+3 | 15 | |
| Reinforcement Learning | MountainCar | Avg Episode Reward0.9402 | 14 | |
| Reinforcement Learning | LunarLander | Average Episode Reward283.6 | 10 | |
| Reinforcement Learning | Inverted Double Pendulum | Avg Episode Reward9.36e+3 | 10 | |
| Reinforcement Learning | Reinforcement Learning Suite Aggregate | Worst Rank6 | 10 | |
| Reinforcement Learning | BipedalWalker | Average Episode Reward309.4 | 10 | |
| Reinforcement Learning | Inverted Pendulum Swingup | Avg Episode Reward890.4 | 10 | |
| Autonomous Racing | TORCS G-Track | Lap Time (s)77.31 | 6 | |
| Autonomous Racing | TORCS (AALBORG) | Lap Time (s)107.9 | 6 |