An Empirical Risk Minimization Approach for Offline Inverse RL and Dynamic Discrete Choice Model
About
We study the problem of estimating Dynamic Discrete Choice (DDC) models, also known as offline Maximum Entropy-Regularized Inverse Reinforcement Learning (offline MaxEnt-IRL) in machine learning. The objective is to recover reward or $Q^*$ functions that govern agent behavior from offline behavior data. In this paper, we propose a globally convergent gradient-based method for solving these problems without the restrictive assumption of linearly parameterized rewards. The novelty of our approach lies in introducing the Empirical Risk Minimization (ERM) based IRL/DDC framework, which circumvents the need for explicit state transition probability estimation in the Bellman equation. Furthermore, our method is compatible with non-parametric estimation techniques such as neural networks. Therefore, the proposed method has the potential to be scaled to high-dimensional, infinite state spaces. A key theoretical insight underlying our approach is that the Bellman residual satisfies the Polyak-Lojasiewicz (PL) condition -- a property that, while weaker than strong convexity, is sufficient to ensure fast global convergence guarantees. Through a series of synthetic experiments, we demonstrate that our approach consistently outperforms benchmark methods and state-of-the-art alternatives.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Reward Estimation | Standard bus engine replacement simulation without dummy variables | MAPE0.12 | 34 | |
| Imitation Learning | CartPole v1 (test) | Optimality (%)100 | 15 | |
| Imitation Learning | Acrobot v1 (test) | Optimality (%)103.7 | 15 | |
| Imitation Learning | Lunar Lander v2 (test) | Optimality (%)107.3 | 15 |