On the efficiency of data collection and aggregation for the combination of multiple classifiers
On the efficiency of data collection and aggregation for the combination of multiple classifiers
Many classification problems are solved by combining the output of a group of distinct predictors. Whether it is voting, consulting domain experts, training an ensemble method or crowdsourcing, the collective consensus we reach is typically more robust and accurate than the decisions of an individual predictor alone. However, aggregating the predictors’ output efficiently is not a trivial endeavour. Furthermore, when we need to solve not just one but multiple classification problems at the same time, the question of how to allocate the limited pool of available predictors arises. These two questions of collecting and aggregating the data from multiple predictors have been addressed to various extents in the existing literature. On the one hand, aggregation algorithms are numerous but are mostly designed for predictive accuracy alone. Achieving state-of-the-art accuracy in a computationally efficient way is currently an open question. On the other hand, empirical studies show that the collection policies we use to allocate the available pool of predictors have a strong impact on the performance of the system. However, to date there is little theoretical understanding of this phenomenon. In this thesis, we tackle these research questions from both a theoretical and an algorithmic angle. First, we develop the theoretical tools to uncover the link between the predictive accuracy of the system and its causal factors: the quality of the predictors, their number and the algorithms we use. We do so by representing the data collection process as a random walk in the posterior probability space, and deriving upper and lower bounds on the expected accuracy. These bounds reveal that the tradeoff between number of predictors and accuracy is always exponential, and allow us to quantify its coefficient. With these tools, we provide the first theoretical explanation of the accuracy gap between different data collection policies. Namely, we prove that the probability of error of adaptive policies decays at more than double the exponential rate of non-adaptive ones. Likewise, we prove that the two most popular adaptive policies, uncertainty sampling and information gain maximisation, are mathematically equivalent. Furthermore, our iv analysis holds both in the case where we know the accuracy of each individual predictor exactly, and in the case where we only have access to some noisy estimate of it. Finally, we revisit the problem of aggregating the predictors’ output by proposing two novel algorithms. The first, Mirror Gibbs, is a refinement of traditional Monte Carlo sampling and achieves better than state-of-the-art accuracy with fewer samples. The second, Streaming Bayesian Inference for Crowdsourcing (SBIC), is based on variational inference and comes in two variants: Fast SBIC is designed for computational speed, while Sorted SBIC is designed for predictive accuracy. Both deliver state-of-the-art accuracy, and feature provable asymptotic guarantees.
University of Southampton
Manino, Edoardo
e5cec65c-c44b-45de-8255-7b1d8edfc04d
August 2020
Manino, Edoardo
e5cec65c-c44b-45de-8255-7b1d8edfc04d
Tran-Thanh, Long
e0666669-d34b-460e-950d-e8b139fab16c
Manino, Edoardo
(2020)
On the efficiency of data collection and aggregation for the combination of multiple classifiers.
Doctoral Thesis, 187pp.
Record type:
Thesis
(Doctoral)
Abstract
Many classification problems are solved by combining the output of a group of distinct predictors. Whether it is voting, consulting domain experts, training an ensemble method or crowdsourcing, the collective consensus we reach is typically more robust and accurate than the decisions of an individual predictor alone. However, aggregating the predictors’ output efficiently is not a trivial endeavour. Furthermore, when we need to solve not just one but multiple classification problems at the same time, the question of how to allocate the limited pool of available predictors arises. These two questions of collecting and aggregating the data from multiple predictors have been addressed to various extents in the existing literature. On the one hand, aggregation algorithms are numerous but are mostly designed for predictive accuracy alone. Achieving state-of-the-art accuracy in a computationally efficient way is currently an open question. On the other hand, empirical studies show that the collection policies we use to allocate the available pool of predictors have a strong impact on the performance of the system. However, to date there is little theoretical understanding of this phenomenon. In this thesis, we tackle these research questions from both a theoretical and an algorithmic angle. First, we develop the theoretical tools to uncover the link between the predictive accuracy of the system and its causal factors: the quality of the predictors, their number and the algorithms we use. We do so by representing the data collection process as a random walk in the posterior probability space, and deriving upper and lower bounds on the expected accuracy. These bounds reveal that the tradeoff between number of predictors and accuracy is always exponential, and allow us to quantify its coefficient. With these tools, we provide the first theoretical explanation of the accuracy gap between different data collection policies. Namely, we prove that the probability of error of adaptive policies decays at more than double the exponential rate of non-adaptive ones. Likewise, we prove that the two most popular adaptive policies, uncertainty sampling and information gain maximisation, are mathematically equivalent. Furthermore, our iv analysis holds both in the case where we know the accuracy of each individual predictor exactly, and in the case where we only have access to some noisy estimate of it. Finally, we revisit the problem of aggregating the predictors’ output by proposing two novel algorithms. The first, Mirror Gibbs, is a refinement of traditional Monte Carlo sampling and achieves better than state-of-the-art accuracy with fewer samples. The second, Streaming Bayesian Inference for Crowdsourcing (SBIC), is based on variational inference and comes in two variants: Fast SBIC is designed for computational speed, while Sorted SBIC is designed for predictive accuracy. Both deliver state-of-the-art accuracy, and feature provable asymptotic guarantees.
Text
PhD_thesis_with_copyright
Text
PDThesis form Manino - SIGNED
Restricted to Repository staff only
More information
Published date: August 2020
Identifiers
Local EPrints ID: 447370
URI: http://eprints.soton.ac.uk/id/eprint/447370
PURE UUID: 8d74d5bb-ca64-4328-aa3d-7429e668e330
Catalogue record
Date deposited: 10 Mar 2021 17:37
Last modified: 16 Mar 2024 11:22
Export record
Contributors
Author:
Edoardo Manino
Thesis advisor:
Long Tran-Thanh
Download statistics
Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.
View more statistics