The Benefits of Being Distributional: Small-Loss Bounds for Reinforcement Learning

About

While distributional reinforcement learning (DistRL) has been empirically effective, the question of when and why it is better than vanilla, non-distributional RL has remained unanswered. This paper explains the benefits of DistRL through the lens of small-loss bounds, which are instance-dependent bounds that scale with optimal achievable cost. Particularly, our bounds converge much faster than those from non-distributional approaches if the optimal cost is small. As warmup, we propose a distributional contextual bandit (DistCB) algorithm, which we show enjoys small-loss regret bounds and empirically outperforms the state-of-the-art on three real-world tasks. In online RL, we propose a DistRL algorithm that constructs confidence sets using maximum likelihood estimation. We prove that our algorithm enjoys novel small-loss PAC bounds in low-rank MDPs. As part of our analysis, we introduce the $\ell_1$ distributional eluder dimension which may be of independent interest. Then, in offline RL, we show that pessimistic DistRL enjoys small-loss PAC bounds that are novel to the offline setting and are more robust to bad single-policy coverage.

Kaiwen Wang, Kevin Zhou, Runzhe Wu, Nathan Kallus, Wen Sun• 2023

Related benchmarks

Task	Dataset	Result
Contextual Bandit	King County Housing (All episodes)	Average Cost0.726	3
Contextual Bandit	King County Housing (Last 100 episodes)	Average Cost0.708	3
Contextual Bandit	Prudential Life Insurance (All episodes)	Average Cost0.411	3
Contextual Bandit	Prudential Life Insurance (Last 100 episodes)	Average Cost0.388	3
Contextual Bandit	CIFAR-100 (All episodes)	Average Cost0.838	3
Contextual Bandit	CIFAR-100 (Last 100 episodes)	Average Cost0.775	3

Showing 6 of 6 rows

Other info

Code

Follow for update

@wizwand_team Discord