Constrained Offline RL via Stationary Distribution Correction Estimation

We consider the offline constrained reinforcement learning problem, where the agent aims to compute a policy that maximizes the reward while satisfying the given cost constraints only using the pre-collected dataset. This problem setting is appealing in many real-world scenarios, where the direct interaction with the environment is costly or risky, and the resulting policy should comply with the safety constraints. However, it is challenging to compute a policy that strictly satisfies the cost constraints in the offline RL setting since the off-policy evaluation inherently has an estimation error. In this work, we present an offline constrained RL algorithm that directly estimates the stationary distribution corrections of the optimal policy while constraining the 'upper bound' of cost value. Our method estimates the cost upper bound efficiently and yields a more conservative policy in terms of cost constraint satisfaction.

Authors' notes