Alchemy: A structured task distribution for meta-reinforcement learning

Recent years have seen the development of powerful deep RL agents in various domains. However, they notably are still lacking in various cognitive abilities, such as performing latent state inference to do complex planning, smart exploration, and quickly generalize to new tasks. In order to make progress in meta-reinforcement learning, we need good benchmarks that push agents to generalize over structured task distributions. Current benchmarks are either too narrow, or don’t contain enough latent structure to require complex reasoning or inference. We introduce a new procedurally generated test environment called Alchemy, which aims to fill these gaps. Along with releasing this environment, we also provide analysis tools and detailed characterization of agent performance. We show that powerful deep RL agents are unable to display understanding of the latent structure in the task.In the process, we also highlight the power of looking beyond episode reward, instead drawing on a more cognitive science framework to gain insights into which cognitive abilities agents are lacking.

Authors' notes

When humans are faced with a new task, we are typically able to tackle it with admirable speed, requiring very little experience to get going. This kind of efficiency and flexibility is something we would also like to see in artificial agents. However, although there has recently been dramatic progress in building deep reinforcement learning (RL) agents that can perform complex tasks after extensive training, getting deep RL agents to rapidly master new tasks remains an open problem.

One promising approach is meta-learning or learning to learn. The idea here is that the learner gains repurposable knowledge across a large set of experiences, and as this knowledge accumulates, it allows the learner to adapt more and more quickly to each new task it encounters. There has been rapidly growing interest in developing methods for meta-learning within deep RL. Although there has been substantive progress toward such ‘meta-reinforcement learning,’ research in this area has been held back by a shortage of benchmark tasks. In the present work, we aim to ease this problem by introducing (and open-sourcing) Alchemy, a useful new benchmark environment for meta-RL, along with a suite of analysis tools.

In order for meta-learning to occur, it is necessary that the environment present the learner not with a single task, but instead with a series or distribution of tasks, all of which have some high-level features in common. Although such interrelated task settings are common in the real world (think of board games, or kitchen tasks, or subway systems), they are notoriously difficult to design for artificial agents operating in simulated environments. Ideally, we would like task distributions that are both interesting and accessible: Interesting in the sense that they involve the rich kinds of shared structure that one sees in real-world tasks; and accessible in the sense that we have complete knowledge of the full task distribution, allowing us to say precisely what the shared structure is that a good meta-learner would pick up on. Previous work on meta-RL has generally relied on tasks distributions that are either accessible without being interesting (such as bandit tasks), or else interesting without being accessible (such as Atari games). Alchemy is designed to offer the best of both worlds.

Alchemy is a single-player video game, implemented in Unity. The player sees a first-person view of a table with a number of objects on it, including a set of colored stones, a set of dishes containing colored potions, and a central cauldron. Stones have different point values, and points are collected when stones are added to the cauldron. By dipping stones into the potions, the player can transform the stones’ appearance, and thus their value, increasing the number of points that can be won.

However, Alchemy also involves a crucially important catch: The ‘chemistry’ that governs how potions affect stones changes every time the game is played. A skillful player must perform a set of targeted experiments to discover how the current chemistry works, and use the results of those experiments to guide strategic action sequences. Learning to do that, over the course of many rounds of Alchemy, is precisely the meta-RL challenge.

Alchemy has an ‘interesting’ structure, in the sense that it involves a compositional set of latent causal relationships, and requires strategic experimentation and action sequencing. But Alchemy’s structure is also ‘accessible,’ since game levels are created based on an explicit generative process.

This accessibility allows us to identify optimal meta-learning performance in Alchemy, by building a Bayes-optimal solver with access to the generative process. This optimal agent offers a valuable gold-standard against which to compare any deep RL agent.

As a first application of Alchemy, we presented it to two powerful deep RL agents (IMPALA and V-MPO). As detailed in our paper, although these agents have been shown to do well in many single-task RL environments, in Alchemy both of them displayed very poor meta-learning performance. Even after extensive training, both agents showed behavior reflecting only a superficial ‘understanding’ of the task -- essentially dipping stones into potions randomly, until a high stone value happened to result. Through a series of detailed analyses, we were able to establish that this failure of meta-learning was due not simply to the visuo-motor challenges of the 3D environment, nor to the difficulty of sequencing actions to achieve goals. Instead, the agents’ poor performance specifically reflected a failure of structure learning and latent-state inference, the core functions involved in meta-learning. Overall, the initial experiments presented in our report suggest that Alchemy may be a useful benchmark task for meta-RL research.

In tandem with our paper, we are releasing Alchemy as a public resource. The release includes multiple versions of the game (including a simplified, symbolic version, and a human-playable version), along with the Bayes-optimal benchmark agent described above and numerous other resources and analysis tools.

By Jane X. Wang, Michael King, Nicolas Porcel, Zeb Kurth-Nelson, Tina Zhu, Charlie Deck, Peter Choy, Mary Cassin, Malcolm Reynolds, Francis Song, Gavin Buttimore, David P. Reichert, Neil Rabinowitz, Loic Matthey, Demis Hassabis, Alex Lerchner, and Matthew Botvinick.