Confident Least Square Value Iteration with Local Access to a Simulator

Learning with simulators is ubiquitous in modern reinforcement learning (RL). The simulator can either correspond to a simplified version of the real environment (such as a physics simulation of a robot arm) or to the environment itself (such as in games like Atari and Go). Among algorithms that are provably sample-efficient in this setting, most make the unrealistic assumption that all possible environment states are known before learning begins, or perform global optimistic planning which is computationally inefficient. In this work, we focus on simulationbased RL under a more realistic local access protocol, where the state space is unknown and the simulator can only be queried at states that have previously been observed (initial states and those returned by previous queries). We propose an algorithm named CONFIDENT-LSVI based on the template of least-square value iteration. CONFIDENT-LSVI incrementally builds a coreset of important states and uses the simulator to revisit them. Assuming that the linear function class has low approximation error under the Bellman optimality operator (a.k.a. low inherent Bellman error), we bound the algorithm performance in terms of this error, and show that it is queryand computationally-efficient.

Authors' notes