The University of Southampton
University of Southampton Institutional Repository

Comprehensive review of classification algorithms for high dimensional datasets

Comprehensive review of classification algorithms for high dimensional datasets
Comprehensive review of classification algorithms for high dimensional datasets
Machine Learning algorithms have been widely used to solve various kinds of data classification problems. Classification problem especially for high dimensional datasets have attracted many researchers in order to find efficient approaches to address them. However, the classification problem has become very complicated and computationally expensive, especially when the number of possible different combinations of variables is so high. In this research, we evaluate the performance of four basic classifiers (naïve Bayes, k-nearest neighbour, decision tree and rule induction), ensemble classifiers (bagging and boosting) and Support Vector Machine. We also investigate two widely-used feature selection algorithms which are Genetic Algorithm (GA) and Particle Swarm Optimization (PSO).

Our experiments show that feature selection algorithms especially GA and PSO significantly reduce the number of features needed as well as greatly reduce the computational cost. Furthermore, these algorithms do not severely reduce the classification accuracy and in some cases they can improve the accuracy as well. PSO has successfully reduced the number of attributes of 9 datasets to 12.78% of original attributes on average while GA is only 30.52% on average. In terms of classification performance, GA is better than PSO. The datasets reduced by GA have better classification performance than their original ones on 5 of 9 datasets while the datasets reduced by PSO have their classification performance improved in only 3 of 9 datasets. The total running time of four basic classifiers (NB, kNN, DT and RI) on 9 original datasets is 68,169 seconds while the total running time of the same classifiers on GA-reduced datasets is 3,799 seconds and on PSO-reduced dataset is only 326 seconds (more than 209 times faster).

We applied ensemble classifiers such as bagging and boosting as a comparison. Our experiment shows that bagging and boosting do not give a significant improvement. The average improvement of bagging when applied to nine datasets is only 0.85% while boosting average improvement is 1.14%. Ensemble classifiers (both bagging and boosting) outperforms single classifier in 6 of 9 datasets. SVM has been proven to perform much better when dealing with high dimensional datasets and numerical features. Although SVM work well with default value, the performance of SVM can be improved significantly using parameter optimization. Our experiment shows SVM parameter optimization using grid search always finds near optimal parameter combination within the given ranges. SVM parameter optimization using grid search is very powerful and it is able to improve the accuracy significantly. Unfortunately, grid search is very slow; therefore it is very reliable only in low dimensional dataset with few parameters. SVM parameter optimization using Evolutionary Algorithm (EA) can be used to solve the problem of grid search. EA has proven to be more stable than grid search. Based on average running time, EA is almost 16 times faster than grid search (294 seconds compare to 4680 seconds). Overall, SVM with parameter optimization outperforms other algorithms in 5 of 9 datasets. However, SVM does not perform well in datasets which have non-numerical attributes.
high dimensional data, feature selection, ensemble classifiers, support vector machine, evolutionary algorithms, parameter optimization
Syarif, Iwan
d6c3eb92-73cf-463b-819c-d97d017e54b5
Syarif, Iwan
d6c3eb92-73cf-463b-819c-d97d017e54b5
Prugel-Bennett, Adam
b107a151-1751-4d8b-b8db-2c395ac4e14e

Syarif, Iwan (2014) Comprehensive review of classification algorithms for high dimensional datasets. University of Southampon, Physical Sciences and Engineering, Doctoral Thesis, 114pp.

Record type: Thesis (Doctoral)

Abstract

Machine Learning algorithms have been widely used to solve various kinds of data classification problems. Classification problem especially for high dimensional datasets have attracted many researchers in order to find efficient approaches to address them. However, the classification problem has become very complicated and computationally expensive, especially when the number of possible different combinations of variables is so high. In this research, we evaluate the performance of four basic classifiers (naïve Bayes, k-nearest neighbour, decision tree and rule induction), ensemble classifiers (bagging and boosting) and Support Vector Machine. We also investigate two widely-used feature selection algorithms which are Genetic Algorithm (GA) and Particle Swarm Optimization (PSO).

Our experiments show that feature selection algorithms especially GA and PSO significantly reduce the number of features needed as well as greatly reduce the computational cost. Furthermore, these algorithms do not severely reduce the classification accuracy and in some cases they can improve the accuracy as well. PSO has successfully reduced the number of attributes of 9 datasets to 12.78% of original attributes on average while GA is only 30.52% on average. In terms of classification performance, GA is better than PSO. The datasets reduced by GA have better classification performance than their original ones on 5 of 9 datasets while the datasets reduced by PSO have their classification performance improved in only 3 of 9 datasets. The total running time of four basic classifiers (NB, kNN, DT and RI) on 9 original datasets is 68,169 seconds while the total running time of the same classifiers on GA-reduced datasets is 3,799 seconds and on PSO-reduced dataset is only 326 seconds (more than 209 times faster).

We applied ensemble classifiers such as bagging and boosting as a comparison. Our experiment shows that bagging and boosting do not give a significant improvement. The average improvement of bagging when applied to nine datasets is only 0.85% while boosting average improvement is 1.14%. Ensemble classifiers (both bagging and boosting) outperforms single classifier in 6 of 9 datasets. SVM has been proven to perform much better when dealing with high dimensional datasets and numerical features. Although SVM work well with default value, the performance of SVM can be improved significantly using parameter optimization. Our experiment shows SVM parameter optimization using grid search always finds near optimal parameter combination within the given ranges. SVM parameter optimization using grid search is very powerful and it is able to improve the accuracy significantly. Unfortunately, grid search is very slow; therefore it is very reliable only in low dimensional dataset with few parameters. SVM parameter optimization using Evolutionary Algorithm (EA) can be used to solve the problem of grid search. EA has proven to be more stable than grid search. Based on average running time, EA is almost 16 times faster than grid search (294 seconds compare to 4680 seconds). Overall, SVM with parameter optimization outperforms other algorithms in 5 of 9 datasets. However, SVM does not perform well in datasets which have non-numerical attributes.

PDF
__userfiles.soton.ac.uk_Users_slb1_mydesktop___soton.ac.uk_ude_personalfiles_users_jo1d13_mydesktop_Syarif.pdf - Other
Download (1MB)

More information

Published date: March 2014
Keywords: high dimensional data, feature selection, ensemble classifiers, support vector machine, evolutionary algorithms, parameter optimization
Organisations: Web & Internet Science

Identifiers

Local EPrints ID: 379927
URI: https://eprints.soton.ac.uk/id/eprint/379927
PURE UUID: a90af8e3-6237-4363-9449-b90b99ad313b

Catalogue record

Date deposited: 17 Aug 2015 13:00
Last modified: 17 Jul 2017 20:39

Export record

Contributors

Author: Iwan Syarif
Thesis advisor: Adam Prugel-Bennett

University divisions

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of https://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×