When agents are trained with offline reinforcement learning (ORL), off-policy policy evaluation (OPE) can be used to select the best agent. However, OPE is challenging and its estimates are not always precise. In many applications it is realistic to assume that interactions with the real environment are too expensive to train a policy, but it is still feasible to evaluate a few selected policies. If we are given an opportunity to interact with an environment, we can hope to obtain a better estimate while maintaining a small budget of interactions with the environment. This problem setting is very relevant, for example, in robotics and language. We refer to this problem as active offline policy selection (active-ops). To use limited interactions wisely, we employ a Bayesian optimisation approach where we start with OPE values and model the dependency between different policies through the actions that they take. We test this approach on several environments and diverse ORL policies.
Reinforcement learning (RL) has made tremendous progress in recent years towards addressing real-life problems – and offline RL made it even more practical. Instead of direct interactions with the environment, we can now train many algorithms from a single pre-recorded dataset. However, we lose the practical advantages in data-efficiency of offline RL when we evaluate the policies at hand.
For example, when training robotic manipulators the robot resources are usually limited, and training many policies by offline RL on a single dataset gives us a large data-efficiency advantage compared to online RL. Evaluating each policy is an expensive process, which requires interacting with the robot thousands of times. When we choose the best algorithm, hyperparameters, and a number of training steps, the problem quickly becomes intractable.
To make RL more applicable to real-world applications like robotics, we propose using an intelligent evaluation procedure to select the policy for deployment, called active offline policy selection (A-OPS). In A-OPS, we make use of the prerecorded dataset and allow limited interactions with the real environment to boost the selection quality.
To minimise interactions with the real environment, we implement three key features:
The returns of the policies are modelled jointly using a Gaussian process, where observations include FQE scores and a small number of newly collected episodic returns from the robot. After evaluating one policy, we gain knowledge about all policies because their distributions are correlated through the kernel between pairs of policies. The kernel assumes that if policies take similar actions – such as moving the robotic gripper in a similar direction – they tend to have similar returns.
We demonstrated this procedure in a number of environments in several domains: dm-control, Atari, simulated, and real robotics. Using A-OPS reduces the regret rapidly, and with a moderate number of policy evaluations, we identify the best policy.
Our results suggest that it’s possible to make an effective offline policy selection with only a small number of environment interactions by utilising the offline data, special kernel, and Bayesian optimisation. The code for A-OPS is open-sourced and available on GitHub with an example dataset to try.