Bandit Pure Exploration with Multiple Correct Answers

Jeudi 4 avril 14:00-15:00 - Rémy Degenne - LPSM

Résumé : The active or pure exploration multi-armed bandit setting is the following : at successive times, an algorithm can sample one of K ``arms’’ (probability distributions) basing its choice on the samples observed at previous rounds. The goal of the algorithm is to answer a query, for example ``which arm has highest mean ?’’ (best arm identification) or ``is there an arm with negative mean ? if yes, return such an arm’’. A good algorithm will answer correctly with high probability and use as few samples as possible to do so.
I will present recent work on the sample complexity of active exploration in the well-studied particular case of best arm identification, and in general. In our new work with Wouter M. Koolen (see, we determine the sample complexity of pure exploration bandit problems with multiple good answers. We derive a lower bound on the sample complexity using a game equilibrium argument. We show how continuity and convexity properties of single-answer problems ensures that an algorithm has asymptotically optimal sample complexity. However, that convexity is lost when going to the multiple-answer setting. We present a new algorithm which extends known techniques to the multiple-answer case and has asymptotic sample complexity matching the lower bound.

Bandit Pure Exploration with Multiple Correct Answers  Version PDF