Comparison and performance enhancement of modern pattern classifiers
Comparison and performance enhancement of modern pattern classifiers
This thesis is a critical empirical study, using a range of benchmark datasets, on the performance of some modern machine learning systems and possible enhancements to them. When new algorithms and their performance are reported in the machine learning literature, most authors pay little attention to reporting the statistical significances in performance differences. We take Gaussian process classifiers as an example, which shows disappointing number of performance evaluations in the literature. What is particularly ignored is any use of the uncertainties in the performance measures when making comparisons. This thesis makes a novel contribution by developing a methodology for formal comparisons that also include performance uncertainties. Using support vector machine (SVM) as classification architectures, the thesis explores two potential enhancements to complexity reduction: (a) subset selection on the training data by some pre-processing approaches, and (b) organising the classes of a multi-class problem in a tree structure for fast classification. The former is crucial, as dataset sizes are known to have increased rapidly, and the straightforward training using quadratic programming over all of the given data is prohibitively expensive. While some researchers focus on training algorithms that operate in a stochastic manner, we explore data reduction by cluster analysis. Multi-class problems in which the number of classes is very large are of increasing interest. Our contribution is to speed up the training by removing as many irrelevant data as possible and preserving the potential data that are believed to be support vectors. The results show that too high a data reduction rate can degrade performance. However, on a subset of problems, the proposed methods have produced comparable results to the full SVM despite the high reduction rate. The new learning tree structure can then be combined with the data selection methods to obtain a further increase in speed. Finally, we also critically review SVM classification problems in which the input data is binary. In the chemoinformatics and bioinformatics literature, the Tanimoto kernel has been empirically shown to have good performance. The work we present, using carefully set up synthetic data of varying dimensions and dataset sizes, casts doubt on such claims. Improvements are noticeable, but not to the extent claimed in previous studies.
University of Southampton
Suppharangsan, Somjet
8a015408-b35c-429e-ba10-9d14e39b994d
November 2010
Suppharangsan, Somjet
8a015408-b35c-429e-ba10-9d14e39b994d
Niranjan, M.
5cbaeea8-7288-4b55-a89c-c43d212ddd4f
Suppharangsan, Somjet
(2010)
Comparison and performance enhancement of modern pattern classifiers.
University of Southampton, School of Electronics and Computer Science, Doctoral Thesis, 144pp.
Record type:
Thesis
(Doctoral)
Abstract
This thesis is a critical empirical study, using a range of benchmark datasets, on the performance of some modern machine learning systems and possible enhancements to them. When new algorithms and their performance are reported in the machine learning literature, most authors pay little attention to reporting the statistical significances in performance differences. We take Gaussian process classifiers as an example, which shows disappointing number of performance evaluations in the literature. What is particularly ignored is any use of the uncertainties in the performance measures when making comparisons. This thesis makes a novel contribution by developing a methodology for formal comparisons that also include performance uncertainties. Using support vector machine (SVM) as classification architectures, the thesis explores two potential enhancements to complexity reduction: (a) subset selection on the training data by some pre-processing approaches, and (b) organising the classes of a multi-class problem in a tree structure for fast classification. The former is crucial, as dataset sizes are known to have increased rapidly, and the straightforward training using quadratic programming over all of the given data is prohibitively expensive. While some researchers focus on training algorithms that operate in a stochastic manner, we explore data reduction by cluster analysis. Multi-class problems in which the number of classes is very large are of increasing interest. Our contribution is to speed up the training by removing as many irrelevant data as possible and preserving the potential data that are believed to be support vectors. The results show that too high a data reduction rate can degrade performance. However, on a subset of problems, the proposed methods have produced comparable results to the full SVM despite the high reduction rate. The new learning tree structure can then be combined with the data selection methods to obtain a further increase in speed. Finally, we also critically review SVM classification problems in which the input data is binary. In the chemoinformatics and bioinformatics literature, the Tanimoto kernel has been empirically shown to have good performance. The work we present, using carefully set up synthetic data of varying dimensions and dataset sizes, casts doubt on such claims. Improvements are noticeable, but not to the extent claimed in previous studies.
Text
S.Suppharangsan-PhDThesis2010.pdf
- Other
More information
Published date: November 2010
Organisations:
University of Southampton
Identifiers
Local EPrints ID: 170393
URI: http://eprints.soton.ac.uk/id/eprint/170393
PURE UUID: 668e4a4c-79c1-4167-8f67-803a42de6d97
Catalogue record
Date deposited: 18 Jan 2011 16:41
Last modified: 14 Mar 2024 02:53
Export record
Contributors
Author:
Somjet Suppharangsan
Thesis advisor:
M. Niranjan
Download statistics
Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.
View more statistics