The University of Southampton
University of Southampton Institutional Repository

On the efficiency of data collection and aggregation for the combination of multiple classifiers

On the efficiency of data collection and aggregation for the combination of multiple classifiers
On the efficiency of data collection and aggregation for the combination of multiple classifiers
Many classification problems are solved by combining the output of a group of distinct predictors. Whether it is voting, consulting domain experts, training an ensemble method or crowdsourcing, the collective consensus we reach is typically more robust and accurate than the decisions of an individual predictor alone. However, aggregating the predictors’ output efficiently is not a trivial endeavour. Furthermore, when we need to solve not just one but multiple classification problems at the same time, the question of how to allocate the limited pool of available predictors arises. These two questions of collecting and aggregating the data from multiple predictors have been addressed to various extents in the existing literature. On the one hand, aggregation algorithms are numerous but are mostly designed for predictive accuracy alone. Achieving state-of-the-art accuracy in a computationally efficient way is currently an open question. On the other hand, empirical studies show that the collection policies we use to allocate the available pool of predictors have a strong impact on the performance of the system. However, to date there is little theoretical understanding of this phenomenon. In this thesis, we tackle these research questions from both a theoretical and an algorithmic angle. First, we develop the theoretical tools to uncover the link between the predictive accuracy of the system and its causal factors: the quality of the predictors, their number and the algorithms we use. We do so by representing the data collection process as a random walk in the posterior probability space, and deriving upper and lower bounds on the expected accuracy. These bounds reveal that the tradeoff between number of predictors and accuracy is always exponential, and allow us to quantify its coefficient. With these tools, we provide the first theoretical explanation of the accuracy gap between different data collection policies. Namely, we prove that the probability of error of adaptive policies decays at more than double the exponential rate of non-adaptive ones. Likewise, we prove that the two most popular adaptive policies, uncertainty sampling and information gain maximisation, are mathematically equivalent. Furthermore, our iv analysis holds both in the case where we know the accuracy of each individual predictor exactly, and in the case where we only have access to some noisy estimate of it. Finally, we revisit the problem of aggregating the predictors’ output by proposing two novel algorithms. The first, Mirror Gibbs, is a refinement of traditional Monte Carlo sampling and achieves better than state-of-the-art accuracy with fewer samples. The second, Streaming Bayesian Inference for Crowdsourcing (SBIC), is based on variational inference and comes in two variants: Fast SBIC is designed for computational speed, while Sorted SBIC is designed for predictive accuracy. Both deliver state-of-the-art accuracy, and feature provable asymptotic guarantees.
University of Southampton
Manino, Edoardo
e5cec65c-c44b-45de-8255-7b1d8edfc04d
Manino, Edoardo
e5cec65c-c44b-45de-8255-7b1d8edfc04d
Tran-Thanh, Long
e0666669-d34b-460e-950d-e8b139fab16c

Manino, Edoardo (2020) On the efficiency of data collection and aggregation for the combination of multiple classifiers. Doctoral Thesis, 187pp.

Record type: Thesis (Doctoral)

Abstract

Many classification problems are solved by combining the output of a group of distinct predictors. Whether it is voting, consulting domain experts, training an ensemble method or crowdsourcing, the collective consensus we reach is typically more robust and accurate than the decisions of an individual predictor alone. However, aggregating the predictors’ output efficiently is not a trivial endeavour. Furthermore, when we need to solve not just one but multiple classification problems at the same time, the question of how to allocate the limited pool of available predictors arises. These two questions of collecting and aggregating the data from multiple predictors have been addressed to various extents in the existing literature. On the one hand, aggregation algorithms are numerous but are mostly designed for predictive accuracy alone. Achieving state-of-the-art accuracy in a computationally efficient way is currently an open question. On the other hand, empirical studies show that the collection policies we use to allocate the available pool of predictors have a strong impact on the performance of the system. However, to date there is little theoretical understanding of this phenomenon. In this thesis, we tackle these research questions from both a theoretical and an algorithmic angle. First, we develop the theoretical tools to uncover the link between the predictive accuracy of the system and its causal factors: the quality of the predictors, their number and the algorithms we use. We do so by representing the data collection process as a random walk in the posterior probability space, and deriving upper and lower bounds on the expected accuracy. These bounds reveal that the tradeoff between number of predictors and accuracy is always exponential, and allow us to quantify its coefficient. With these tools, we provide the first theoretical explanation of the accuracy gap between different data collection policies. Namely, we prove that the probability of error of adaptive policies decays at more than double the exponential rate of non-adaptive ones. Likewise, we prove that the two most popular adaptive policies, uncertainty sampling and information gain maximisation, are mathematically equivalent. Furthermore, our iv analysis holds both in the case where we know the accuracy of each individual predictor exactly, and in the case where we only have access to some noisy estimate of it. Finally, we revisit the problem of aggregating the predictors’ output by proposing two novel algorithms. The first, Mirror Gibbs, is a refinement of traditional Monte Carlo sampling and achieves better than state-of-the-art accuracy with fewer samples. The second, Streaming Bayesian Inference for Crowdsourcing (SBIC), is based on variational inference and comes in two variants: Fast SBIC is designed for computational speed, while Sorted SBIC is designed for predictive accuracy. Both deliver state-of-the-art accuracy, and feature provable asymptotic guarantees.

Text
PhD_thesis_with_copyright
Download (1MB)
Text
PDThesis form Manino - SIGNED
Restricted to Repository staff only

More information

Published date: August 2020

Identifiers

Local EPrints ID: 447370
URI: http://eprints.soton.ac.uk/id/eprint/447370
PURE UUID: 8d74d5bb-ca64-4328-aa3d-7429e668e330
ORCID for Edoardo Manino: ORCID iD orcid.org/0000-0003-0028-5440
ORCID for Long Tran-Thanh: ORCID iD orcid.org/0000-0003-1617-8316

Catalogue record

Date deposited: 10 Mar 2021 17:37
Last modified: 16 Mar 2024 11:22

Export record

Contributors

Author: Edoardo Manino ORCID iD
Thesis advisor: Long Tran-Thanh ORCID iD

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of http://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×