One-pass algorithms for large and shifting data sets
University of Southampton, School of Electronics and Computer Science,
For many problem domains, practitioners are faced with the problem of ever-increasing amounts of data. Examples include the UniProt database of proteins which now contains ~6 million sequences, and the KDD ’99 data which consists of ~5 million points. At these scales, the state-of-the-art machine learning techniques are not applicable since the multiple passes they require through the data are prohibitively expensive, and a need for different approaches arises. Another issue arising in real-world tasks, which is only recently becoming a topic of interest in the machine learning community, is distribution shift, which occurs naturally in many problem domains such as intrusion detection and EEG signal mapping in the Brain-Computer Interface domain. This means that the i.i.d. assumption between the training and test data does not hold, causing classifiers to perform poorly on the unseen test set.
We first present a novel, hierarchical, one-pass clustering technique that is capable of handling very large data. Our experiments show that the quality of the clusters generated by our method does not degrade, while making vast computational savings compared to algorithms that require multiple passes through the data. We then propose Voted Spheres, a novel, non-linear, one-pass, multi-class classification technique capable of handling millions of points in minutes. Our empirical study shows that it achieves state-of-the-art performance on real world data sets, in a fraction of the time required by other methods. We then adapt the VS to deal with covariate shift between the training and test phases using two different techniques: an importance weighting scheme and kernel mean matching. Our results on a toy problem and the real-world KDD ’99 data show an increase in performance to our VS framework. Our final contribution involves applying the one-pass VS algorithm, along with the adapted counterpart (for covariate shift), to the Brain-Computer Interface domain, in which linear batch algorithms are generally used. Our VS-based methods outperform the SVM, and perform very competitively with the submissions of a recent BCI competition, which further shows the robustness of our proposed techniques to different problem domains.
Actions (login required)