Counterfactual Credit Assignment

Credit assignment in reinforcement learning is about measuring an action’s influence on future rewards. This is made difficult by the fact that rewards are also affected by other random choices occurring in the future. In this paper we attempt to separate skill from luck. More precisely, we seek to disentangle the effect which an action had on the return from the effects of external factors and subsequent actions. To achieve this, we borrow the concept of counterfactuals from causality theory and adapt it to a model-free reinforcement learning setup. The key idea is to condition the value function on future events. We introduce a new concept, future-conditional baselines and critics which extract hindsight information from a trajectory, and show under which conditions such value functions are valid. We discuss connections between causality theory and RL and develop a practical algorithm which minimizes bias in the policy gradient estimates by constraining the hindsight information not to contain information about the agent’s action. We demonstrate the efficacy and validity of our algorithm on a number of illustrative problems.