Earning while learning: using Thompson Sampling to maximize rewards from online sales.

Ellina, Andria (2020) Earning while learning: using Thompson Sampling to maximize rewards from online sales. University of Southampton, Doctoral Thesis, 171pp.

Record type: Thesis (Doctoral)

Abstract

The problem of finding the best option amongst a range of suboptimal candidates in an uncertain environment is a challenging task in a number of domains ranging from clinical trials to advertising, website optimization and dynamic pricing.

Initially, very little is known about the performance of the different options and the decision maker needs to simultaneously learn about the performance of the different options and earn some reward from the decisions made. This introduces a trade-off between "exploration", the phase where new information is being acquired and "exploitation", where the goal is maximizing rewards or alternatively minimizing total regret. Regret is defined as the difference between the reward of an oracle strategy that selects the best option at each time step and the reward of the option we choose.

In this thesis, we develop new algorithms based on Thompson Sampling that improve the overall performance and minimize total regret. Numerical experiments are performed on simulated datasets in order to examine the effect of the algorithms’ hyperparameters, to assess the robustness of the algorithms presented and compare the performance of our new algorithms with current algorithms. We use benchmarking experiments for a fair comparison of the different algorithms
on the simulated datasets.

An additional complication, especially common in the area of revenue management, is seasonal changes that have an impact on the performance of the different options and consequently affect our decisions. In order to tackle the challenge of non-stationarity we deploy contextual Thompson Sampling to account for seasonality and develop a new algorithm that combines contextual Thompson Sampling with a standard statistical model selection method to solve the problem of unknown seasonality in the reward distribution of the candidate options.

Finally, we focus on an application of dynamic pricing in which we develop an algorithm that learns how to price a product in a competitive environment where the demand function is unknown. Using simulation we compare our algorithm in an oligopoly and duopoly setting with a set of other algorithms introduced elsewhere in the literature.

Text

Thesis_all_chapters_final (1) - Version of Record

Available under License University of Southampton Thesis Licence.

Download (6MB)

Text

e-thesis form - Version of Record

Restricted to Repository staff only

Available under License University of Southampton Thesis Licence.