Earning while learning: using Thompson Sampling to maximize rewards from online sales.
Earning while learning: using Thompson Sampling to maximize rewards from online sales.
The problem of finding the best option amongst a range of suboptimal candidates in an uncertain environment is a challenging task in a number of domains ranging from clinical trials to advertising, website optimization and dynamic pricing.
Initially, very little is known about the performance of the different options and the decision maker needs to simultaneously learn about the performance of the different options and earn some reward from the decisions made. This introduces a trade-off between "exploration", the phase where new information is being acquired and "exploitation", where the goal is maximizing rewards or alternatively minimizing total regret. Regret is defined as the difference between the reward of an oracle strategy that selects the best option at each time step and the reward of the option we choose.
In this thesis, we develop new algorithms based on Thompson Sampling that improve the overall performance and minimize total regret. Numerical experiments are performed on simulated datasets in order to examine the effect of the algorithms’ hyperparameters, to assess the robustness of the algorithms presented and compare the performance of our new algorithms with current algorithms. We use benchmarking experiments for a fair comparison of the different algorithms
on the simulated datasets.
An additional complication, especially common in the area of revenue management, is seasonal changes that have an impact on the performance of the different options and consequently affect our decisions. In order to tackle the challenge of non-stationarity we deploy contextual Thompson Sampling to account for seasonality and develop a new algorithm that combines contextual Thompson Sampling with a standard statistical model selection method to solve the problem of unknown seasonality in the reward distribution of the candidate options.
Finally, we focus on an application of dynamic pricing in which we develop an algorithm that learns how to price a product in a competitive environment where the demand function is unknown. Using simulation we compare our algorithm in an oligopoly and duopoly setting with a set of other algorithms introduced elsewhere in the literature.
University of Southampton
Ellina, Andria
7347cbed-1419-4877-8b56-490db931ca0b
2020
Ellina, Andria
7347cbed-1419-4877-8b56-490db931ca0b
Currie, Christine
dcfd0972-1b42-4fac-8a67-0258cfdeb55a
Ellina, Andria
(2020)
Earning while learning: using Thompson Sampling to maximize rewards from online sales.
University of Southampton, Doctoral Thesis, 171pp.
Record type:
Thesis
(Doctoral)
Abstract
The problem of finding the best option amongst a range of suboptimal candidates in an uncertain environment is a challenging task in a number of domains ranging from clinical trials to advertising, website optimization and dynamic pricing.
Initially, very little is known about the performance of the different options and the decision maker needs to simultaneously learn about the performance of the different options and earn some reward from the decisions made. This introduces a trade-off between "exploration", the phase where new information is being acquired and "exploitation", where the goal is maximizing rewards or alternatively minimizing total regret. Regret is defined as the difference between the reward of an oracle strategy that selects the best option at each time step and the reward of the option we choose.
In this thesis, we develop new algorithms based on Thompson Sampling that improve the overall performance and minimize total regret. Numerical experiments are performed on simulated datasets in order to examine the effect of the algorithms’ hyperparameters, to assess the robustness of the algorithms presented and compare the performance of our new algorithms with current algorithms. We use benchmarking experiments for a fair comparison of the different algorithms
on the simulated datasets.
An additional complication, especially common in the area of revenue management, is seasonal changes that have an impact on the performance of the different options and consequently affect our decisions. In order to tackle the challenge of non-stationarity we deploy contextual Thompson Sampling to account for seasonality and develop a new algorithm that combines contextual Thompson Sampling with a standard statistical model selection method to solve the problem of unknown seasonality in the reward distribution of the candidate options.
Finally, we focus on an application of dynamic pricing in which we develop an algorithm that learns how to price a product in a competitive environment where the demand function is unknown. Using simulation we compare our algorithm in an oligopoly and duopoly setting with a set of other algorithms introduced elsewhere in the literature.
Text
Thesis_all_chapters_final (1)
- Version of Record
Text
e-thesis form
- Version of Record
Restricted to Repository staff only
More information
Submitted date: April 2019
Published date: 2020
Identifiers
Local EPrints ID: 452895
URI: http://eprints.soton.ac.uk/id/eprint/452895
PURE UUID: 32259233-4567-4af5-a6ea-4bb421bacbd1
Catalogue record
Date deposited: 06 Jan 2022 17:47
Last modified: 17 Mar 2024 02:56
Export record
Contributors
Author:
Andria Ellina
Download statistics
Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.
View more statistics