Quantifying Differences in Reward Functions

For many tasks, the reward function is too complex to be specified procedurally, and must instead be learned from user data. Prior work has evaluated such learned reward functions by examining rollouts from a policy optimized for the learned reward. However, this method has two problems. (1) It is not predictive of policy training’s robustness to a change in transition dynamics. (2) It cannot distinguish between the learned reward function failing to reflect user preferences, and the reinforcement learning algorithm failing to optimize the learned reward. To solve this problem, we introduce the Equivalent Policy Invariant Comparison (EPIC) pseudometric to quantify the difference between two reward functions directly, without the need to train a policy. We show that EPIC is invariant to equivalent reward functions that are guaranteed to induce the same optimal policy. Furthermore, we compute EPIC distances between hand-designed reward functions on toy environments and a range of MuJoCo, finding the results to be more reliable than baseline methods. Finally, we show that the EPIC distance to the ground-truth reward function is predictive of the success of training a policy, even in different transition dynamics. In this sense, our method is more informative than evaluation via policy training. Resubmitting paper to ICLR, original milestone here:

Authors' notes