Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Inverse Reinforcement Learning with the Average Reward Criterion

About

We study the problem of Inverse Reinforcement Learning (IRL) with an average-reward criterion. The goal is to recover an unknown policy and a reward function when the agent only has samples of states and actions from an experienced agent. Previous IRL methods assume that the expert is trained in a discounted environment, and the discount factor is known. This work alleviates this assumption by proposing an average-reward framework with efficient learning algorithms. We develop novel stochastic first-order methods to solve the IRL problem under the average-reward setting, which requires solving an Average-reward Markov Decision Process (AMDP) as a subproblem. To solve the subproblem, we develop a Stochastic Policy Mirror Descent (SPMD) method under general state and action spaces that needs $\mathcal{{O}}(1/\varepsilon)$ steps of gradient computation. Equipped with SPMD, we propose the Inverse Policy Mirror Descent (IPMD) method for solving the IRL problem with a $\mathcal{O}(1/\varepsilon^2)$ complexity. To the best of our knowledge, the aforementioned complexity results are new in IRL. Finally, we corroborate our analysis with numerical experiments using the MuJoCo benchmark and additional control tasks.

Feiyang Wu, Jingyang Ke, Anqi Wu• 2023

Related benchmarks

TaskDatasetResultRank
Reinforcement LearningMuJoCo Half-Cheetah
Average Return1.30e+4
18
Reinforcement LearningMuJoCo Hopper
Average Return3.62e+3
14
Reinforcement LearningMuJoCo Walker
Average Return4.45e+3
14
Reinforcement LearningMuJoCo Ant
Average Return5.23e+3
14
Inverse Reinforcement LearningMuJoCo Walker (test)
Average Performance5.42e+3
4
Inverse Reinforcement LearningMuJoCo Humanoid (test)
Average Performance7.38e+3
4
Inverse Reinforcement LearningMuJoCo Hopper (test)
Average Performance3.56e+3
4
Inverse Reinforcement LearningMuJoCo Half-Cheetah (test)
Average Performance1.26e+4
4
Inverse Reinforcement LearningMuJoCo Ant (test)
Average Performance4.05e+3
4
Reinforcement LearningMuJoCo Humanoid
Average Return1.02e+4
2
Showing 10 of 10 rows

Other info

Code

Follow for update