Budget-limited multi-armed bandits

Tran-Thanh, Long (2012) Budget-limited multi-armed bandits. University of Southampton, Faculty of Physical and Applied Sciences, Doctoral Thesis, 173pp.

Record type: Thesis (Doctoral)

Abstract

Decision making under uncertainty is one of the most important challenges within the research field of artificial intelligence, as they present many everyday situations that agents have to face. Within these situations, an agent has to choose from a set of options, whose payoff is uncertain (i.e. unknown and nondeterministic) to the agent. Common to such decision making problems is the need of balancing between exploration and exploitation, where the agent, in order to maximise its total payoff, must decide whether to choose the option expected to provide the best payoff (exploitation) or to try an alternative option for potential future benefit (exploration). Among many decision under uncertainty abstractions, multi–armed bandits are perhaps one of the most common and best studied, as they present one of the clearest examples of the trade–off between exploration and exploitation. Whilst the standard bandit model has a broad applicability, it does not completely describe a number of real–world decision making problems. Specifically, in many cases, pulling choice of arm (i.e. making a decision)is further constrained by several costs or limitations. In this thesis, we introduce the budget–limited bandit model, a variant of the standard bandits, in which pulling an arm is costly, and is limited by a fixed budget. This model is motivated by a number of real–world applications, such as wireless sensor networks, or online advertisement.

We demonstrate that our bandit model cannot be reduced to other existing bandits, as it requires a different optimal behaviour. Given this, the main objective of this thesis is to provide novel pulling algorithms that efficiently tackle the budget–limited bandit problem. Such algorithms, however, have to meet a number of requirements from both the empirical and the theoretical perspectives. The former refers to the constraints desired by the motivations of real–world applications, whilst the latter aims to provide theoretical performance guarantees. To begin with, we propose a simple pulling algorithm, the budget–limited ε–first, that addresses the empirical requirements. In more detail, the budget–limited ε–first algorithm is an empirically efficient algorithm with low computational cost, which, however, does not fulfil the theoretical requirements. To provide theoretical guarantees, we introduce two budget–limited UCB based algorithms, namely: KUBE and fractional KUBE, that efficiently tackle the theoretical requirements. In particular, we prove that these algorithms achieve asymptotically optimal performance regret bounds, which only differ from the best optimal bound by a constant factor. However, we demonstrate in extensive simulations that these algorithms are typically outperformed by the budget–limited ε–first. As a result, to efficiently trade off between theoretical and empirical requirements, we develop two decreasing ε–greedy based approaches, namely: KDE and fractional KDE, that achieve good performance from both the theoretical and the empirical perspective. Specifically, we show that, similar to the budget–limited UCB based algorithms, both KDE and fractional KDE achieve asymptotically optimal performance regret bounds. In addition, we also demonstrate that these algorithms perform well, compared to the budget–limited ε first.

To provide a grounding for the algorithms we develop, the second part of this thesis contains a running example of a wireless sensor network (WSN) scenario, in which we tackle the problem of long–term information collection, a key research challenge within the domain of WSNs. In more detail, we demonstrate that by using the budget–limited bandit algorithms, we advance the state–of–the–art within this domain. In so doing, we first decompose the problem of long–term information collection into two sub–problems, namely the energy management and the maximal information throughput routing problems. We then tackle the former with a budget–limited multi–armed bandit based approach, and we propose an optimal decentralised algorithm for the latter. Following this, we demonstrate that the budget–limited bandit based energy management, in conjunction with the optimal routing algorithm, outperforms the state–of–the–art information collecting algorithms in the domain of WSNs.

Text

LTT_PhD_thesis.pdf - Other

Download (1MB)