Shunxing Yan, Han Zhong
Optimism can stabilize Thompson sampling in multi-armed bandits, enabling valid asymptotic inference with minimal additional regret.
Thompson sampling is a method used in decision-making scenarios, such as choosing between different options with uncertain outcomes, known as multi-armed bandits. However, when data is collected adaptively, traditional statistical methods can struggle to make accurate inferences. This paper shows that by incorporating 'optimism' into Thompson sampling, stability can be achieved, allowing for reliable conclusions. The authors demonstrate that this approach works for scenarios with multiple optimal choices and only slightly increases the regret, or the cost of not always making the best choice.