Improved Regret Bound and Experience Replay in Regularized Policy Iteration

We study algorithms for learning in infinite-horizon undiscounted Markov Decision Processes (MDPs) with function approximation. We first show that the regret analysis of \textsc{Politex}, a version of regularized policy iteration, can be sharpened from $O(T^{3/4})$ to $O(\sqrt{T})$ under nearly identical assumptions, and instantiate the bound with linear function approximation. Our result is the first high-probability $O(\sqrt{T})$ regret bound for a computationally efficient algorithm in this setting. Our analysis suggests that we need to approximate the average of the action-value functions of past policies well. Thus we propose an efficient implementation of Politex with neural networks, where we train a single $Q$-function on a replay buffer with past data. We show that it often leads to superior performance over other implementation choices, especially in terms of wall-clock time. This work also provides a novel theoretical justification for using experience replay within policy iteration algorithms.

Authors' notes