Adversarial blocking bandits
Adversarial blocking bandits
We consider a general adversarial multi-armed blocking bandit setting where each played arm can be blocked (unavailable) for some time periods and the reward per arm is given at each time period adversarially without obeying any distribution. The setting models scenarios of allocating scarce limited supplies (e.g., arms) where the supplies replenish and can be reused only after certain time periods. We first show that, in the optimization setting, when the blocking durations and rewards are known in advance, finding an optimal policy (e.g., determining which arm per round) that maximises the cumulative reward is strongly NP-hard, eliminating the possibility of a fully polynomial-time approximation scheme (FPTAS) for the problem unless P = NP. To complement our result, we show that a greedy algorithm that plays the best available arm at each round provides an approximation guarantee that depends on the blocking durations and the path variance of the rewards. In the bandit setting, when the blocking durations and rewards are not known, we design two algorithms, RGA and RGA-META, for the case of bounded duration an path variation. In particular, when the variation budget B_T is known in advance, RGA can achieve O(\sqrt{T(2\tilde{D}+K)B_{T}}) dynamic approximate regret. On the other hand, when B_T is not known, we show that the dynamic approximate regret of RGA-META is at most O((K+\tilde{D})^{1/4}\tilde{B}^{1/2}T^{3/4}) where \tilde{B} is the maximal path variation budget within each batch of RGA-META (which is provably in order of o(\sqrt{T}). We also prove that if either the variation budget or the maximal blocking duration is unbounded, the approximate regret will be at least Theta(T). We also show that the regret upper bound of RGA is tight if the blocking durations are bounded above by an order of O(1).
Online Learning, Bandit Algorithms, Sequential Decision Making
Neural Information Processing Systems Foundation
Bishop, Nicholas
e2b8dc1a-a609-4709-84af-9b2455fd73e6
Chan, Hau
4d760146-3e9b-4ba9-8cdb-74203c759421
Mandal, Debmalya
f09a45db-9c07-4d64-a891-0dcb073af277
Tran-Thanh, Long
e0666669-d34b-460e-950d-e8b139fab16c
2020
Bishop, Nicholas
e2b8dc1a-a609-4709-84af-9b2455fd73e6
Chan, Hau
4d760146-3e9b-4ba9-8cdb-74203c759421
Mandal, Debmalya
f09a45db-9c07-4d64-a891-0dcb073af277
Tran-Thanh, Long
e0666669-d34b-460e-950d-e8b139fab16c
Bishop, Nicholas, Chan, Hau, Mandal, Debmalya and Tran-Thanh, Long
(2020)
Adversarial blocking bandits.
Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M.F. and Lin, H.
(eds.)
In Advances in Neural Information Processing Systems 33 (NeurIPS 2020).
Neural Information Processing Systems Foundation..
Record type:
Conference or Workshop Item
(Paper)
Abstract
We consider a general adversarial multi-armed blocking bandit setting where each played arm can be blocked (unavailable) for some time periods and the reward per arm is given at each time period adversarially without obeying any distribution. The setting models scenarios of allocating scarce limited supplies (e.g., arms) where the supplies replenish and can be reused only after certain time periods. We first show that, in the optimization setting, when the blocking durations and rewards are known in advance, finding an optimal policy (e.g., determining which arm per round) that maximises the cumulative reward is strongly NP-hard, eliminating the possibility of a fully polynomial-time approximation scheme (FPTAS) for the problem unless P = NP. To complement our result, we show that a greedy algorithm that plays the best available arm at each round provides an approximation guarantee that depends on the blocking durations and the path variance of the rewards. In the bandit setting, when the blocking durations and rewards are not known, we design two algorithms, RGA and RGA-META, for the case of bounded duration an path variation. In particular, when the variation budget B_T is known in advance, RGA can achieve O(\sqrt{T(2\tilde{D}+K)B_{T}}) dynamic approximate regret. On the other hand, when B_T is not known, we show that the dynamic approximate regret of RGA-META is at most O((K+\tilde{D})^{1/4}\tilde{B}^{1/2}T^{3/4}) where \tilde{B} is the maximal path variation budget within each batch of RGA-META (which is provably in order of o(\sqrt{T}). We also prove that if either the variation budget or the maximal blocking duration is unbounded, the approximate regret will be at least Theta(T). We also show that the regret upper bound of RGA is tight if the blocking durations are bounded above by an order of O(1).
Text
Adversarial Blocking Bandits
- Author's Original
More information
Accepted/In Press date: 25 September 2020
Published date: 2020
Keywords:
Online Learning, Bandit Algorithms, Sequential Decision Making
Identifiers
Local EPrints ID: 445488
URI: http://eprints.soton.ac.uk/id/eprint/445488
PURE UUID: f1520424-d368-4c6b-86d6-aabdcc26c312
Catalogue record
Date deposited: 11 Dec 2020 17:30
Last modified: 09 Apr 2024 22:02
Export record
Contributors
Author:
Nicholas Bishop
Author:
Hau Chan
Author:
Debmalya Mandal
Author:
Long Tran-Thanh
Editor:
H. Larochelle
Editor:
M. Ranzato
Editor:
R. Hadsell
Editor:
M.F. Balcan
Editor:
H. Lin
Download statistics
Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.
View more statistics