TL;DR: We present a method for training reinforcement learning agents from human feedback in the presence of unknown unsafe states.
When we train reinforcement learning (RL) agents in the real world, we don’t want them to explore unsafe states, such as driving a mobile robot into a ditch or writing an embarrassing email to one’s boss. Training RL agents in the presence of unsafe states is known as the safe exploration problem. We tackle the hardest version of this problem, in which the agent initially doesn’t know how the environment works or where the unsafe states are. The agent has one source of information: feedback about unsafe states from a human user.
Existing methods for training agents from human feedback ask the user to evaluate data of the agent acting in the environment. That is – in order to learn about unsafe states, the agent first needs to visit these states, so the user can provide feedback on them. This makes prior work inapplicable to tasks that require safe exploration.
In our latest paper, we propose a method for reward modeling that operates in two phases. First, the system is encouraged to explore a wide range of states through synthetically-generated, hypothetical behaviour. The user provides feedback on this hypothetical behaviour, and the system interactively learns a model of the user's reward function. Only after the model has successfully learned to predict rewards and unsafe states, we deploy an RL agent that safely performs the desired task.
We start with a generative model of initial states and a forward dynamics model, trained on off-policy data like random trajectories or safe expert demonstrations. Our method uses these models to synthesise hypothetical behaviours, asks the user to label the behaviours with rewards, and trains a neural network to predict these rewards. The key idea is to actively synthesise the hypothetical behaviours from scratch to make them as informative as possible, without interacting with the environment. We call this method reward query synthesis via trajectory optimisation (ReQueST).
Synthesising informative hypotheticals using trajectory optimisation
For this approach to work, we need the system to simulate and explore a wide range of behaviours, in order to effectively train the reward model. To encourage exploration during reward model training, ReQueST synthesises four different types of hypothetical behaviours using gradient descent trajectory optimisation. The first type of hypothetical behaviour maximises the uncertainty of an ensemble of reward models, eliciting user labels for behaviours that have the highest information value. The second type of hypothetical behaviour maximises predicted rewards, surfacing behaviours for which the reward model might be incorrectly predicting high rewards; i.e., reward hacking. The third type of hypothetical behaviour minimises predicted rewards, adding potentially unsafe hypothetical behaviours to the training data. This data enables the reward model to learn about unsafe states. The fourth type of hypothetical behaviour maximises the novelty of trajectories, encouraging exploration of a wide range of states, regardless of predicted rewards.
Training the reward model using supervised learning
Each hypothetical behaviour consists of a sequence of state transitions (s, a, s’). We ask the user to label each state transition with a reward, r. Then, given the labeled dataset of transitions (s, a, r, s’), we train a neural network to predict rewards using a maximum-likelihood objective. We use standard supervised learning techniques based on gradient descent.
Deploying a model-based RL agent
Once the user is satisfied with the reward model, we deploy a planning-based agent that uses model-predictive control (MPC) to pick actions that optimise the learned rewards. Unlike model-free RL algorithms like Q-learning or policy gradient methods that learn through trial and error, model-based RL algorithms like MPC enable the agent to avoid unsafe states during deployment by using the dynamics model to anticipate the consequences of its actions.
We evaluate ReQueST with simulated users on a state-based 2D navigation task and the image-based Car Racing video game. Our results show that ReQueST satisfies three important safety properties: it can train a reward model to detect unsafe states without visiting them; it can correct reward hacking before deploying the agent; and it tends to learn robust reward models that perform well when transferred to new environments.
Testing generalisation in a toy 2D navigation task
To test the generalisation of the reward model, we set up a 2D navigation task with separate training and test environments.
We intentionally introduce a significant shift in the initial state distribution: the agent starts at the lower left corner (0, 0) in the training environment, and at the upper right corner (1, 1) in the test environment. Prior methods that collect data by deploying an agent in the training environment are unlikely to learn about the trap in the upper right corner, because they immediately find the goal, then fail to continue exploring. ReQueST synthesizes a variety of hypothetical states, including states in and around the trap. The user labels these states with rewards, using which ReQueST learns a robust reward model that enables the agent to navigate around the trap in the test environment.
Testing scalability in image-based Car Racing
To test whether ReQueST scales to domains with high-dimensional, continuous states like images, we use the Car Racing video game from the OpenAI Gym.
In addition to benchmarking ReQueST against prior methods, we ran a hyperparameter sweep and ablation study, where we varied the regularization strength of the dynamics model during trajectory optimisation as well as the subset of hypotheticals synthesized in order to measure ReQueST’s sensitivity to these settings. We found that ReQueST can trade off between producing realistic vs. informative queries, and that the optimal trade-off varies across domains. We also found that the usefulness of each of the four hypothetical behaviours depends on the domain and the amount of training data collected.
To our knowledge, ReQueST is the first reward modeling algorithm that safely learns about unsafe states and scales to training neural network reward models in environments with high-dimensional, continuous states.
ReQueST relies on a generative model of initial states and a forward dynamics model, which can be hard to acquire for visual domains with complex dynamics. So far, we have only demonstrated the effectiveness of ReQueST in simulated domains with relatively simple dynamics. One direction for future work is to test ReQueST in 3D domains with more realistic physics and other agents acting in the environment.
- If you want to learn more, check out our preprint on arXiv: Siddharth Reddy, Anca D. Dragan, Sergey Levine, Shane Legg, Jan Leike, Learning Human Objectives by Evaluating Hypothetical Behavior, arXiv, 2019.
- To encourage replication and extensions, we have released our code.
- Listen to our podcast to learn more about DeepMind's commitment to building safe AI.
Thanks to Zac Kenton and Kelly Clancy for feedback on early drafts of this post, and to Paulo Estriga for his design work.