Adversarial blocking bandits

Adversarial blocking bandits

We consider a general adversarial multi-armed blocking bandit setting where each played arm can be blocked (unavailable) for some time periods and the reward per arm is given at each time period adversarially without obeying any distribution. The setting models scenarios of allocating scarce limited supplies (e.g., arms) where the supplies replenish and can be reused only after certain time periods. We first show that, in the optimization setting, when the blocking durations and rewards are known in advance, finding an optimal policy (e.g., determining which arm per round) that maximises the cumulative reward is strongly NP-hard, eliminating the possibility of a fully polynomial-time approximation scheme (FPTAS) for the problem unless P = NP. To complement our result, we show that a greedy algorithm that plays the best available arm at each round provides an approximation guarantee that depends on the blocking durations and the path variance of the rewards. In the bandit setting, when the blocking durations and rewards are not known, we design two algorithms, RGA and RGA-META, for the case of bounded duration an path variation. In particular, when the variation budget B_T is known in advance, RGA can achieve O(\sqrt{T(2\tilde{D}+K)B_{T}}) dynamic approximate regret. On the other hand, when B_T is not known, we show that the dynamic approximate regret of RGA-META is at most O((K+\tilde{D})^{1/4}\tilde{B}^{1/2}T^{3/4}) where \tilde{B} is the maximal path variation budget within each batch of RGA-META (which is provably in order of o(\sqrt{T}). We also prove that if either the variation budget or the maximal blocking duration is unbounded, the approximate regret will be at least Theta(T). We also show that the regret upper bound of RGA is tight if the blocking durations are bounded above by an order of O(1).

Online Learning, Bandit Algorithms, Sequential Decision Making

Bishop, Nicholas

e2b8dc1a-a609-4709-84af-9b2455fd73e6

Chan, Hau

4d760146-3e9b-4ba9-8cdb-74203c759421

Mandal, Debmalya

f09a45db-9c07-4d64-a891-0dcb073af277

Tran-Thanh, Long

e0666669-d34b-460e-950d-e8b139fab16c

2020

Bishop, Nicholas

e2b8dc1a-a609-4709-84af-9b2455fd73e6

Chan, Hau

4d760146-3e9b-4ba9-8cdb-74203c759421

Mandal, Debmalya

f09a45db-9c07-4d64-a891-0dcb073af277

Tran-Thanh, Long

e0666669-d34b-460e-950d-e8b139fab16c

Bishop, Nicholas, Chan, Hau, Mandal, Debmalya and Tran-Thanh, Long
(2020)
Adversarial blocking bandits.
Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M.F. and Lin, H.
(eds.)
In *Advances in Neural Information Processing Systems 33 (NeurIPS 2020). *
NeurIPS..

Record type:
Conference or Workshop Item
(Paper)

## Abstract

We consider a general adversarial multi-armed blocking bandit setting where each played arm can be blocked (unavailable) for some time periods and the reward per arm is given at each time period adversarially without obeying any distribution. The setting models scenarios of allocating scarce limited supplies (e.g., arms) where the supplies replenish and can be reused only after certain time periods. We first show that, in the optimization setting, when the blocking durations and rewards are known in advance, finding an optimal policy (e.g., determining which arm per round) that maximises the cumulative reward is strongly NP-hard, eliminating the possibility of a fully polynomial-time approximation scheme (FPTAS) for the problem unless P = NP. To complement our result, we show that a greedy algorithm that plays the best available arm at each round provides an approximation guarantee that depends on the blocking durations and the path variance of the rewards. In the bandit setting, when the blocking durations and rewards are not known, we design two algorithms, RGA and RGA-META, for the case of bounded duration an path variation. In particular, when the variation budget B_T is known in advance, RGA can achieve O(\sqrt{T(2\tilde{D}+K)B_{T}}) dynamic approximate regret. On the other hand, when B_T is not known, we show that the dynamic approximate regret of RGA-META is at most O((K+\tilde{D})^{1/4}\tilde{B}^{1/2}T^{3/4}) where \tilde{B} is the maximal path variation budget within each batch of RGA-META (which is provably in order of o(\sqrt{T}). We also prove that if either the variation budget or the maximal blocking duration is unbounded, the approximate regret will be at least Theta(T). We also show that the regret upper bound of RGA is tight if the blocking durations are bounded above by an order of O(1).

Text

** Adversarial Blocking Bandits
- Author's Original**
## More information

Accepted/In Press date: 25 September 2020

Published date: 2020

Keywords:
Online Learning, Bandit Algorithms, Sequential Decision Making

## Identifiers

Local EPrints ID: 445488

URI: http://eprints.soton.ac.uk/id/eprint/445488

PURE UUID: f1520424-d368-4c6b-86d6-aabdcc26c312

## Catalogue record

Date deposited: 11 Dec 2020 17:30

Last modified: 28 Apr 2022 02:02

## Export record

## Contributors

Author:
Nicholas Bishop

Author:
Hau Chan

Author:
Debmalya Mandal

Author:
Long Tran-Thanh
Editor:
H. Larochelle

Editor:
M. Ranzato

Editor:
R. Hadsell

Editor:
M.F. Balcan

Editor:
H. Lin

## Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics