Temporal Difference Uncertainties: an Epistemic Signal for Exploration

An effective approach to exploration in reinforcement learning is to rely on posterior sampling from the agent's beliefs over the optimal policy. Such methods are grounded in the agent's epistemic uncertainty, which is a near-optimal exploration strategy in tabular settings. However prior works face challenges when scaling to deep reinforcement learning, often resulting in sacrifices that bias the estimated uncertainty. In this paper, we propose a novel method for estimating epistemic uncertainty that relies on inducing a distribution over temporal difference errors. While uncertainty over value estimates can be biased by environment stochasticity and may not be temporally consistent, our measure controls for such irreducible stochasticity and isolates the component of uncertainty in value that is due to uncertainty over model parameters. Because our measure of uncertainty conditions on the future, we cannot act on it directly. We therefore incorporate uncertainty through learning and treat exploration as a separate learning problem that is induced by the agent's temporal difference uncertainties over the reward-maximising policy. Distinct exploration policies learn to collect data with high estimated uncertainty, which gives rise to a ``curriculum'' that smoothly changes throughout learning and vanishes in the limit of perfect value estimates. We evaluate our proposed method on a set of hard-exploration tasks in both grid-world and Atari 2600 environments and find that this form of exploration facilitates both diverse and deep exploration and can adapt exploration as learning dynamics change.

Authors' notes