Bandit Pure Exploration with Multiple Correct Answers

Jeudi 4 avril 14:00-15:00 - Rémy Degenne - LPSM

Résumé : The active or pure exploration multi-armed bandit setting is the following : at successive times, an algorithm can sample one of K ``arms’’ (probability distributions) basing its choice on the samples observed at previous rounds. The goal of the algorithm is to answer a query, for example ``which arm has highest mean ?’’ (best arm identification) or ``is there an arm with negative mean ? if yes, return such an arm’’. A good algorithm will answer correctly with high probability and use as few samples as possible to do so.
I will present recent work on the sample complexity of active exploration in the well-studied particular case of best arm identification, and in general. In our new work with Wouter M. Koolen (see https://arxiv.org/abs/1902.03475), we determine the sample complexity of pure exploration bandit problems with multiple good answers. We derive a lower bound on the sample complexity using a game equilibrium argument. We show how continuity and convexity properties of single-answer problems ensures that an algorithm has asymptotically optimal sample complexity. However, that convexity is lost when going to the multiple-answer setting. We present a new algorithm which extends known techniques to the multiple-answer case and has asymptotic sample complexity matching the lower bound.

Bandit Pure Exploration with Multiple Correct Answers  Version PDF