A common vision from science fiction is that AI will one day sense the world as we do, assist our physical labours, and communicate with us through natural language. Here we study how to design artificial agents that can interact naturally with humans using the simplification of a virtual environment. This setting nevertheless integrates a number of the central challenges of artificial intelligence (AI) research: complex visual perception and goal-directed physical control, grounded language comprehension and production, and multi-agent social interaction. To build agents that can robustly interact with humans, we would ideally train them while they interact with humans. However, this is presently impractical. Therefore, we approximate the role of the human with another learned agent, and use ideas from inverse reinforcement learning to reduce the disparities between human-human and agent-agent interactive behaviour.
Rigorously evaluating our agents poses a great challenge, so we develop a variety of behavioural tests, including evaluation by humans who watch videos of agents or interact directly with them. These evaluations convincingly demonstrate that interactive training and auxiliary losses improve agent behaviour beyond what is achieved by supervised learning of actions alone. Further, we demonstrate that agent capabilities generalise beyond literal experiences in the dataset. Finally, we train evaluation models whose ratings of agents agree well with human judgement, thus permitting the evaluation of new agent models without additional effort. Taken together, our results in this virtual environment provide evidence that large-scale human behavioural imitation is a promising tool to create intelligent, interactive agents, and the challenge of reliably evaluating such agents is possible to surmount. See videos for an overview of the manuscript, training time-lapse, and human-agent interactions.
Two questions must be answered at the outset of any artificial intelligence research. What do we want AI systems to do? And how will we evaluate when we are making progress toward this goal? Alan Turing, in his seminal paper describing the Turing Test, which he more modestly named the imitation game, argued that for a certain kind of AI, these questions may be one and the same. Roughly, if an AI’s behaviour resembles human-like intelligence when a person interacts with it, then the AI has passed the test and can be called intelligent. An AI that is designed to interact with humans should be tested via interaction with humans.
At the same time, interaction is not just a test of intelligence but also the point. For AI agents to be generally helpful, they should assist us in diverse activities and communicate with us naturally. In science fiction, the vision of robots that we can speak to is commonplace. And intelligent digital agents that can help accomplish large numbers of tasks would be eminently useful. To bring these devices into reality, we therefore must study the problem of how to create agents that can capably interact with humans and produce actions in a rich world.
Building agents that can interact with humans and the world poses a number of important challenges. How can we provide appropriate learning signals to teach artificial agents such abilities? How can we evaluate the performance of the agents we develop, when language itself is ambiguous and abstract? As the wind tunnel is to the design of the airplane, we have created a virtual environment for researching how to make interacting agents.
We first create a simulated environment, the Playroom, in which virtual robots can engage in a variety of interesting interactions by moving around, manipulating objects, and speaking to each other. The Playroom’s dimensions can be randomised as can its allocation of shelves, furniture, landmarks like windows and doors, and an assortment of children's toys and domestic objects. The diversity of the environment enables interactions involving reasoning about space and object relations, ambiguity of references, containment, construction, support, occlusion, partial observability. We embedded two agents in the Playroom to provide a social dimension for studying joint intentionality, cooperation, communication of private knowledge, and so on.
We harness a range of learning paradigms to build agents that can interact with humans, including imitation learning, reinforcement learning, supervised, and unsupervised learning. As Turing may have anticipated in naming “the imitation game,” perhaps the most direct route to create agents that can interact with humans is through imitation of human behaviour. Large datasets of human behaviour along with algorithms for imitation learning from those data have been instrumental for making agents that can interact with textual language or play games. For grounded language interactions, we have no readily available, pre-existing data source of behaviour, so we created a system for eliciting interactions from human participants interacting with each other. These interactions were elicited primarily by prompting one of the players with a cue to improvise an instruction about, e.g., “Ask the other player to position something relative to something else.” Some of the interaction prompts involve questions as well as instructions, like “Ask the other player to describe where something is.” In total, we collected more than a year of real-time human interactions in this setting.
Imitation learning, reinforcement learning, and auxiliary learning (consisting of supervised and unsupervised representation learning) are integrated into a form of interactive self-play that is crucial to create our best agents. Such agents can follow commands and answer questions. We call these agents “solvers.” But our agents can also provide commands and ask questions. We call these agents “setters.” Setters interactively pose problems to solvers to produce better solvers. However, once the agents are trained, humans can play as setters and interact with solver agents.
Our interactions cannot be evaluated in the same way that most simple reinforcement learning problems can. There is no notion of winning or losing, for example. Indeed, communicating with language while sharing a physical environment introduces a surprising number of abstract and ambiguous notions. For example, if a setter asks a solver to put something near something else, what exactly is “near”? But accurate evaluation of trained models in standardised settings is a linchpin of modern machine learning and artificial intelligence. To cope with this setting, we have developed a variety of evaluation methods to help diagnose problems in and score agents, including simply having humans interact with agents in large trials.
A distinct advantage of our setting is that human operators can set a virtually infinite set of new tasks via language, and quickly understand the competencies of our agents. There are many tasks that they cannot cope with, but our approach to building AIs offers a clear path for improvement across a growing set of competencies. Our methods are general and can be applied wherever we need agents that interact with complex environments and people.