Model-Based Reinforcement Learning in Discrete-Action Non-Markovian Reward Decision Processes

About

Many practical decision-making problems involve tasks whose success depends on the entire system history, rather than on achieving a state with desired properties. Markovian Reinforcement Learning (RL) approaches are not suitable for such tasks, while RL with non-Markovian reward decision processes (NMRDPs) enables agents to tackle temporal-dependency tasks. This approach has long been known to lack formal guarantees on both (near-)optimality and sample efficiency. We contribute to solving both issues with QR-MAX, a novel model-based algorithm for discrete NMRDPs that factorizes Markovian transition learning from non-Markovian reward handling via reward machines. To the best of our knowledge, this is the first model-based RL algorithm for discrete-action NMRDPs that exploits this factorization to obtain PAC convergence to $\varepsilon$-optimal policies with polynomial sample complexity. We then extend QR-MAX to continuous state spaces with Bucket-QR-MAX, a SimHash-based discretiser that preserves the same factorized structure and achieves fast and stable learning without manual gridding or function approximation. We experimentally compare our method with modern state-of-the-art model-based RL approaches on environments of increasing complexity, showing a significant improvement in sample efficiency and increased robustness in finding optimal policies.

Alessandro Trapasso, Luca Iocchi, Fabio Patrizi• 2025

Related benchmarks

Task	Dataset	Result
Policy Optimization	Office World MAP0	Avg Training Steps4.15e+3	18
Policy Optimization	Office World MAP1	Avg Training Steps3.13e+3	7
Policy Optimization	Office World MAP4	Average Training Steps5.63e+3	7
Policy Optimization	Office World Map 1, Exp 5	Average Training Steps3.13e+3	7
Policy Optimization	Office World Map 4 Exp 6	Average Training Steps5.63e+3	7
Policy Optimization	Office World Map 2 Exp 5	Average Training Steps3.77e+3	7
Policy Optimization	Office World Map 3, Exp 5	Average Training Steps5.81e+3	7
Reinforcement Learning	Office World Map 1	Training Steps7.31e+3	6
Reinforcement Learning	Office World Map 2	Training Steps1.62e+4	6
Reinforcement Learning	Office World Map 3	Steps to 100% Success2.52e+4	6

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord