One-pass algorithms for large and shifting data sets

Farran, Bassam (2010) One-pass algorithms for large and shifting data sets. University of Southampton, School of Electronics and Computer Science, Doctoral Thesis, 144pp.

Record type: Thesis (Doctoral)

Abstract

For many problem domains, practitioners are faced with the problem of ever-increasing amounts of data. Examples include the UniProt database of proteins which now contains ~6 million sequences, and the KDD ’99 data which consists of ~5 million points. At these scales, the state-of-the-art machine learning techniques are not applicable since the multiple passes they require through the data are prohibitively expensive, and a need for different approaches arises. Another issue arising in real-world tasks, which is only recently becoming a topic of interest in the machine learning community, is distribution shift, which occurs naturally in many problem domains such as intrusion detection and EEG signal mapping in the Brain-Computer Interface domain. This means that the i.i.d. assumption between the training and test data does not hold, causing classifiers to perform poorly on the unseen test set.

We first present a novel, hierarchical, one-pass clustering technique that is capable of handling very large data. Our experiments show that the quality of the clusters generated by our method does not degrade, while making vast computational savings compared to algorithms that require multiple passes through the data. We then propose Voted Spheres, a novel, non-linear, one-pass, multi-class classification technique capable of handling millions of points in minutes. Our empirical study shows that it achieves state-of-the-art performance on real world data sets, in a fraction of the time required by other methods. We then adapt the VS to deal with covariate shift between the training and test phases using two different techniques: an importance weighting scheme and kernel mean matching. Our results on a toy problem and the real-world KDD ’99 data show an increase in performance to our VS framework. Our final contribution involves applying the one-pass VS algorithm, along with the adapted counterpart (for covariate shift), to the Brain-Computer Interface domain, in which linear batch algorithms are generally used. Our VS-based methods outperform the SVM, and perform very competitively with the submissions of a recent BCI competition, which further shows the robustness of our proposed techniques to different problem domains.

Text

Thesis.pdf - Other

Download (1MB)

More information

Published date: June 2010

Organisations: University of Southampton

Identifiers

Local EPrints ID: 159173

URI: http://eprints.soton.ac.uk/id/eprint/159173

PURE UUID: 98c43df4-753c-475d-af4c-10d4cd6372f2

ORCID for Mahesan Niranjan:

orcid.org/0000-0001-7021-140X

Catalogue record

Date deposited: 16 Jul 2010 11:51

Last modified: 14 Mar 2024 02:53

Export record

Share this record

Share this on Facebook Share this on Twitter Share this on Weibo

Contributors

Author: Bassam Farran

Thesis advisor: Mahesan Niranjan

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Library staff additional information