The University of Southampton
University of Southampton Institutional Repository

Adversarial blocking bandits

Adversarial blocking bandits
Adversarial blocking bandits
We consider a general adversarial multi-armed blocking bandit setting where each played arm can be blocked (unavailable) for some time periods and the reward per arm is given at each time period adversarially without obeying any distribution. The setting models scenarios of allocating scarce limited supplies (e.g., arms) where the supplies replenish and can be reused only after certain time periods. We first show that, in the optimization setting, when the blocking durations and rewards are known in advance, finding an optimal policy (e.g., determining which arm per round) that maximises the cumulative reward is strongly NP-hard, eliminating the possibility of a fully polynomial-time approximation scheme (FPTAS) for the problem unless P = NP. To complement our result, we show that a greedy algorithm that plays the best available arm at each round provides an approximation guarantee that depends on the blocking durations and the path variance of the rewards. In the bandit setting, when the blocking durations and rewards are not known, we design two algorithms, RGA and RGA-META, for the case of bounded duration an path variation. In particular, when the variation budget B_T is known in advance, RGA can achieve O(\sqrt{T(2\tilde{D}+K)B_{T}}) dynamic approximate regret. On the other hand, when B_T is not known, we show that the dynamic approximate regret of RGA-META is at most O((K+\tilde{D})^{1/4}\tilde{B}^{1/2}T^{3/4}) where \tilde{B} is the maximal path variation budget within each batch of RGA-META (which is provably in order of o(\sqrt{T}). We also prove that if either the variation budget or the maximal blocking duration is unbounded, the approximate regret will be at least Theta(T). We also show that the regret upper bound of RGA is tight if the blocking durations are bounded above by an order of O(1).
Online Learning, Bandit Algorithms, Sequential Decision Making
Neural Information Processing Systems Foundation
Bishop, Nicholas
e2b8dc1a-a609-4709-84af-9b2455fd73e6
Chan, Hau
4d760146-3e9b-4ba9-8cdb-74203c759421
Mandal, Debmalya
f09a45db-9c07-4d64-a891-0dcb073af277
Tran-Thanh, Long
e0666669-d34b-460e-950d-e8b139fab16c
Larochelle, H.
Ranzato, M.
Hadsell, R.
Balcan, M.F.
Lin, H.
Bishop, Nicholas
e2b8dc1a-a609-4709-84af-9b2455fd73e6
Chan, Hau
4d760146-3e9b-4ba9-8cdb-74203c759421
Mandal, Debmalya
f09a45db-9c07-4d64-a891-0dcb073af277
Tran-Thanh, Long
e0666669-d34b-460e-950d-e8b139fab16c
Larochelle, H.
Ranzato, M.
Hadsell, R.
Balcan, M.F.
Lin, H.

Bishop, Nicholas, Chan, Hau, Mandal, Debmalya and Tran-Thanh, Long (2020) Adversarial blocking bandits. Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M.F. and Lin, H. (eds.) In Advances in Neural Information Processing Systems 33 (NeurIPS 2020). Neural Information Processing Systems Foundation..

Record type: Conference or Workshop Item (Paper)

Abstract

We consider a general adversarial multi-armed blocking bandit setting where each played arm can be blocked (unavailable) for some time periods and the reward per arm is given at each time period adversarially without obeying any distribution. The setting models scenarios of allocating scarce limited supplies (e.g., arms) where the supplies replenish and can be reused only after certain time periods. We first show that, in the optimization setting, when the blocking durations and rewards are known in advance, finding an optimal policy (e.g., determining which arm per round) that maximises the cumulative reward is strongly NP-hard, eliminating the possibility of a fully polynomial-time approximation scheme (FPTAS) for the problem unless P = NP. To complement our result, we show that a greedy algorithm that plays the best available arm at each round provides an approximation guarantee that depends on the blocking durations and the path variance of the rewards. In the bandit setting, when the blocking durations and rewards are not known, we design two algorithms, RGA and RGA-META, for the case of bounded duration an path variation. In particular, when the variation budget B_T is known in advance, RGA can achieve O(\sqrt{T(2\tilde{D}+K)B_{T}}) dynamic approximate regret. On the other hand, when B_T is not known, we show that the dynamic approximate regret of RGA-META is at most O((K+\tilde{D})^{1/4}\tilde{B}^{1/2}T^{3/4}) where \tilde{B} is the maximal path variation budget within each batch of RGA-META (which is provably in order of o(\sqrt{T}). We also prove that if either the variation budget or the maximal blocking duration is unbounded, the approximate regret will be at least Theta(T). We also show that the regret upper bound of RGA is tight if the blocking durations are bounded above by an order of O(1).

Text
Adversarial Blocking Bandits - Author's Original
Download (363kB)

More information

Accepted/In Press date: 25 September 2020
Published date: 2020
Keywords: Online Learning, Bandit Algorithms, Sequential Decision Making

Identifiers

Local EPrints ID: 445488
URI: http://eprints.soton.ac.uk/id/eprint/445488
PURE UUID: f1520424-d368-4c6b-86d6-aabdcc26c312
ORCID for Nicholas Bishop: ORCID iD orcid.org/0000-0001-7062-9072
ORCID for Long Tran-Thanh: ORCID iD orcid.org/0000-0003-1617-8316

Catalogue record

Date deposited: 11 Dec 2020 17:30
Last modified: 09 Apr 2024 22:02

Export record

Contributors

Author: Nicholas Bishop ORCID iD
Author: Hau Chan
Author: Debmalya Mandal
Author: Long Tran-Thanh ORCID iD
Editor: H. Larochelle
Editor: M. Ranzato
Editor: R. Hadsell
Editor: M.F. Balcan
Editor: H. Lin

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of http://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×