Bandit Phase Retrieval

We explore the relationship between information gain and regret in a bandit version of noisy phase retrieval where the learner chooses actions $A_t$ with $\Vert A_t\Vert_2 \leq 1$ and the expected reward is $\langle A_t, \theta_\star\rangle^2$ for unknown parameter vector $\theta_\star \in \mathbb R^d$. We prove that the minimax regret in this problem is $\smash{\tilde \Theta(d \sqrt{n})}$, which improves on the best known bounds by a factor of $\smash{\sqrt{d}}$. More importantly, our analysis shows that an apparently convincing heuristic for guessing lower bounds can be misleading. We also show that the simple regret for this problem is $\tilde \Theta(d / \sqrt{n})$ and that this is only achievable with an adaptive algorithm.

Authors' notes