When should agents explore?

Exploration remains a central challenge for reinforcement learning (RL). One feature that virtually all existing methods share is a monolithic behaviour policy that changes only gradually (at best). In contrast, the exploratory behaviours of animals and humans exhibit a rich diversity, namely including forms of switching between modes. This paper presents an initial study of mode-switching, non-monolithic exploration for RL. We investigate different modes to switch between, at what timescales it makes sense to switch, and what signals make for good switching triggers. We also propose practical algorithmic tweaks that make the switching mechanism adaptive and robust, which prevents the additional flexibility from turning into a hyper-parameter-tuning burden. Finally, we report promising initial results on Atari, using two-mode exploration and switching at sub-episodic time-scales.

Authors' notes