Marginalized operators for off-policy reinforcement learning

In this work, we propose marginalized operators, a new class of off-policy evaluation operators for reinforcement learning. Marginalized operators strictly generalize generic multi-step operators as special cases. When recovering multi-step operators, marginalized operators produce a form of sample-based estimates with potential variance reduction, compared to sample-based estimates of the original multi-step operators. By extending estimation techniques of marginalized importance sampling, we show that sample-based estimates of marginalized operators could be computed in a scalable way. Finally, we empirically demonstrate that marginalized operators could potentially benefit downstream policy optimization.

Authors' notes