The Exploration-Exploitation Tradeoff in Sequential Decision Making Problems
The Exploration-Exploitation Tradeoff in Sequential Decision Making Problems
Sequential decision making problems often require an agent to act in an environment where data is noisy or not fully observed. The agent will have to learn how different actions relate to different rewards, and must therefore balance the need to explore and exploit in an effective strategy. In this report, sequential decision making problems are considered through extensions of the multi-armed bandit framework. Firstly, the bandit problem is extended to a Multi-Agent System (MAS), where agents control individual arms but can communicate potentially useful information with each other. This framework allows for a better understanding of the exploration-exploitation tradeoff in scenarios where there are multiple agents interacting in a noisy environment. To this end, we present a novel strategy for action and communication decisions and we demonstrate the benefits of such a strategy empirically. This motivates a theoretical analysis of one-armed bandit problems, to develop ideas of how different strategies are optimally tuned. Specifically, the expected rewards of e-greedy strategies are derived, as well as proofs governing their optimal tuning.
Sykulski, Adam M.
6cec63f1-86f7-435f-8192-cc1fe10d9fad
Sykulski, Adam M.
6cec63f1-86f7-435f-8192-cc1fe10d9fad
Sykulski, Adam M.
(2009)
The Exploration-Exploitation Tradeoff in Sequential Decision Making Problems
(In Press)
Record type:
Monograph
(Project Report)
Abstract
Sequential decision making problems often require an agent to act in an environment where data is noisy or not fully observed. The agent will have to learn how different actions relate to different rewards, and must therefore balance the need to explore and exploit in an effective strategy. In this report, sequential decision making problems are considered through extensions of the multi-armed bandit framework. Firstly, the bandit problem is extended to a Multi-Agent System (MAS), where agents control individual arms but can communicate potentially useful information with each other. This framework allows for a better understanding of the exploration-exploitation tradeoff in scenarios where there are multiple agents interacting in a noisy environment. To this end, we present a novel strategy for action and communication decisions and we demonstrate the benefits of such a strategy empirically. This motivates a theoretical analysis of one-armed bandit problems, to develop ideas of how different strategies are optimally tuned. Specifically, the expected rewards of e-greedy strategies are derived, as well as proofs governing their optimal tuning.
Other
TRANSFER_REPORT_ADAM.PDF
- Other
More information
Accepted/In Press date: May 2009
Organisations:
Electronics & Computer Science
Identifiers
Local EPrints ID: 271349
URI: http://eprints.soton.ac.uk/id/eprint/271349
PURE UUID: 70edf59d-1af0-4e43-910d-ce4dca841ee0
Catalogue record
Date deposited: 05 Jul 2010 23:53
Last modified: 14 Mar 2024 09:29
Export record
Contributors
Author:
Adam M. Sykulski
Download statistics
Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.
View more statistics