Recent years have seen the development of powerful deep RL agents in various domains. However, they notably are still lacking in various cognitive abilities, such as performing latent state inference to do complex planning, smart exploration, and quickly generalize to new tasks. In order to make progress in meta-reinforcement learning, we need good benchmarks that push agents to generalize over structured task distributions. Current benchmarks are either too narrow, or don’t contain enough latent structure to require complex reasoning or inference. We introduce a new procedurally generated test environment called Alchemy, which aims to fill these gaps. Along with releasing this environment, we also provide analysis tools and detailed characterization of agent performance. We show that powerful deep RL agents are unable to display understanding of the latent structure in the task.In the process, we also highlight the power of looking beyond episode reward, instead drawing on a more cognitive science framework to gain insights into which cognitive abilities agents are lacking.