Contextual Information-Directed Sampling

Information-directed sampling (IDS) has recently demonstrated the potential as a data-efficient reinforcement learning algorithm. However, it's still unclear what is the right form of information ratio to optimize when the context or observation is available. In this paper, we study the contextual bandit problem with i.i.d contexts. We refer to the version of IDS in \cite{lu2021reinforcement} as conditional IDS since they tend to optimize the information ratio conditional on the current context. In contrast, we study the class of IDS called contextual IDS where we take the expectation of the context distribution for the information ratio. The main message is that an intelligent agent should invest more on the actions that are beneficial for the future unseen context. The conditional IDS could be myopic. We hope our findings could shed light on the algorithm design for the full reinforcement learning setting.

Authors' notes