Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Efficient Symbolic Policy Learning with Differentiable Symbolic Expression

About

Deep reinforcement learning (DRL) has led to a wide range of advances in sequential decision-making tasks. However, the complexity of neural network policies makes it difficult to understand and deploy with limited computational resources. Currently, employing compact symbolic expressions as symbolic policies is a promising strategy to obtain simple and interpretable policies. Previous symbolic policy methods usually involve complex training processes and pre-trained neural network policies, which are inefficient and limit the application of symbolic policies. In this paper, we propose an efficient gradient-based learning method named Efficient Symbolic Policy Learning (ESPL) that learns the symbolic policy from scratch in an end-to-end way. We introduce a symbolic network as the search space and employ a path selector to find the compact symbolic policy. By doing so we represent the policy with a differentiable symbolic expression and train it in an off-policy manner which further improves the efficiency. In addition, in contrast with previous symbolic policies which only work in single-task RL because of complexity, we expand ESPL on meta-RL to generate symbolic policies for unseen tasks. Experimentally, we show that our approach generates symbolic policies with higher performance and greatly improves data efficiency for single-task RL. In meta-RL, we demonstrate that compared with neural network policies the proposed symbolic policy achieves higher performance and efficiency and shows the potential to be interpretable.

Jiaming Guo, Rui Zhang, Shaohui Peng, Qi Yi, Xing Hu, Ruizhi Chen, Zidong Du, Xishan Zhang, Ling Li, Qi Guo, Yunji Chen• 2023

Related benchmarks

TaskDatasetResultRank
Reinforcement LearningPendulum
Avg Episode Reward-151.7
15
Reinforcement LearningHopper
Avg Episode Reward2.44e+3
15
Reinforcement LearningMountainCar
Avg Episode Reward0.9402
14
Reinforcement LearningLunarLander
Average Episode Reward283.6
10
Reinforcement LearningInverted Double Pendulum
Avg Episode Reward9.36e+3
10
Reinforcement LearningReinforcement Learning Suite Aggregate
Worst Rank6
10
Reinforcement LearningBipedalWalker
Average Episode Reward309.4
10
Reinforcement LearningInverted Pendulum Swingup
Avg Episode Reward890.4
10
Autonomous RacingTORCS G-Track
Lap Time (s)77.31
6
Autonomous RacingTORCS (AALBORG)
Lap Time (s)107.9
6
Showing 10 of 43 rows

Other info

Follow for update