No Regrets for Learning the Prior in Bandits

We propose to use Thompson sampling to adapt to a sequence of bandit tasks that the algorithm interacts with in a sequential manner. Our algorithm, which we call AdaTS, achieves lower regret than the alternatives that do not adapt by maintaining a distribution over the parameters of the prior, which drives additional exploration. AdaTS is a natural \emph{fully-Bayesian} algorithm, which can be implemented efficiently in a number of scenarios, as we demonstrate in this work. Our bounds on its Bayes regret show benefits of the extra layer of adaptivity. Our theoretical findings are also supported by experiments, where AdaTS outperforms prior algorithms and is applied to a challenging real-world problem.

Authors' notes