An Empirical Risk Minimization Approach for Offline Inverse RL and Dynamic Discrete Choice Model

About

We study the problem of estimating Dynamic Discrete Choice (DDC) models, also known as offline Maximum Entropy-Regularized Inverse Reinforcement Learning (offline MaxEnt-IRL) in machine learning. The objective is to recover reward or $Q^*$ functions that govern agent behavior from offline behavior data. In this paper, we propose a globally convergent gradient-based method for solving these problems without the restrictive assumption of linearly parameterized rewards. The novelty of our approach lies in introducing the Empirical Risk Minimization (ERM) based IRL/DDC framework, which circumvents the need for explicit state transition probability estimation in the Bellman equation. Furthermore, our method is compatible with non-parametric estimation techniques such as neural networks. Therefore, the proposed method has the potential to be scaled to high-dimensional, infinite state spaces. A key theoretical insight underlying our approach is that the Bellman residual satisfies the Polyak-Lojasiewicz (PL) condition -- a property that, while weaker than strong convexity, is sufficient to ensure fast global convergence guarantees. Through a series of synthetic experiments, we demonstrate that our approach consistently outperforms benchmark methods and state-of-the-art alternatives.

Enoch H. Kang, Hema Yoganarasimhan, Lalit Jain• 2025

Related benchmarks

Task	Dataset	Result
Reward Estimation	Standard bus engine replacement simulation without dummy variables	MAPE0.12	34
Imitation Learning	CartPole v1 (test)	Optimality (%)100	15
Imitation Learning	Acrobot v1 (test)	Optimality (%)103.7	15
Imitation Learning	Lunar Lander v2 (test)	Optimality (%)107.3	15

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord