Paper
Publication
BYOL-Explore: Exploration with Bootstrapped Prediction
Abstract

We present BYOL-Explore, a conceptually simple yet general approach for curiosity-driven exploration in visually-complex environments. BYOL-Explore learns a world representation, the world dynamics, and an exploration policy all-together by optimizing a single prediction loss in the latent space with no additional auxiliary objective. We show that BYOL-Explore is effective in DM-HARD-8, a challenging partially-observable continuous-action hard-exploration benchmark with visually-rich 3-D environments. On this benchmark, we solve the majority of the tasks purely through augmenting the extrinsic reward with BYOL-Explore's intrinsic reward, whereas prior work could only get off the ground with human demonstrations. As further evidence of the generality of BYOL-Explore, we show that it achieves superhuman performance on the ten hardest exploration games in Atari while having a much simpler design than other competitive agents.

Authors' notes
Second-person and top-down views of a BYOL-Explore agent solving Thow-Across level of DM-HARD-8, whereas pure RL and other baseline exploration methods fail to make any progress on Thow-Across.

Curiosity-driven exploration is the active process of seeking new information to enhance the agent’s understanding of its environment. Suppose that the agent has learned a model of the world that can predict future events given the history of past events. The curiosity-driven agent can then use the prediction mismatch of the world model as the intrinsic reward for directing its exploration policy towards seeking new information. As follows, the agent can then use this new information to enhance the world model itself so it can make better predictions.  This iterative process can allow the agent to eventually explore every novelty  in the world and use this information to build an accurate world model.

Inspired by the successes of bootstrap your own latent (BYOL) – which has been applied in computer vision, graph representation learning, and representation learning in RL – we propose BYOL-Explore: a conceptually simple yet general, curiosity-driven AI agent for solving hard-exploration tasks. BYOL-Explore learns a representation of the world by predicting its own future representation. Then, it uses the prediction-error at the representation level as an intrinsic reward to train a curiosity-driven policy. Therefore, BYOL-Explore learns a world representation, the world dynamics, and a curiosity-driven exploration policy all-together, simply by optimising the prediction error at the representation level.

Comparison between BYOL-Explore, Random Network Distillation (RND), Intrinsic Curiosity Module (ICM) and pure RL (no intrinsic reward), in terms of mean capped human-normalised score (CHNS).

Despite the simplicity of its design, when applied to the DM-HARD-8 suite of challenging 3-D, visually complex, and hard exploration tasks, BYOL-Explore outperforms standard curiosity-driven exploration methods such as Random Network Distillation (RND) and Intrinsic Curiosity Module (ICM), in terms of mean capped human-normalised score (CHNS), measured across all tasks. Remarkably, BYOL-Explore achieved this performance using only a single network concurrently trained across all tasks, whereas prior work was restricted to the single-task setting and could only make meaningful progress on these tasks when provided with human expert demonstrations. 

As further evidence of its generality, BYOL-Explore achieves super-human performance in the ten hardest exploration Atari games, while having a simpler design than other competitive agents, such as Agent57 and Go-Explore.

Comparison between BYOL-Explore, Random Network Distillation (RND), Intrinsic Curiosity Module (ICM) and pure RL (no intrinsic reward), in terms of mean capped human-normalised score (CHNS).

Moving forward, we can generalise BYOL-Explore to highly stochastic environments by learning a probabilistic world model that could be used to generate trajectories of the future events. This could allow the agent to model the possible stochasticity of the environment, avoid stochastic traps, and plan for exploration.