Comprehensive review of classification algorithms for high dimensional datasets

Machine Learning algorithms have been widely used to solve various kinds of data classification problems. Classification problem especially for high dimensional datasets have attracted many researchers in order to find efficient approaches to address them. However, the classification problem has become very complicated and computationally expensive, especially when the number of possible different combinations of variables is so high. In this research, we evaluate the performance of four basic classifiers (naïve Bayes, k-nearest neighbour, decision tree and rule induction), ensemble classifiers (bagging and boosting) and Support Vector Machine. We also investigate two widely-used feature selection algorithms which are Genetic Algorithm (GA) and Particle Swarm Optimization (PSO).

Our experiments show that feature selection algorithms especially GA and PSO significantly reduce the number of features needed as well as greatly reduce the computational cost. Furthermore, these algorithms do not severely reduce the classification accuracy and in some cases they can improve the accuracy as well. PSO has successfully reduced the number of attributes of 9 datasets to 12.78% of original attributes on average while GA is only 30.52% on average. In terms of classification performance, GA is better than PSO. The datasets reduced by GA have better classification performance than their original ones on 5 of 9 datasets while the datasets reduced by PSO have their classification performance improved in only 3 of 9 datasets. The total running time of four basic classifiers (NB, kNN, DT and RI) on 9 original datasets is 68,169 seconds while the total running time of the same classifiers on GA-reduced datasets is 3,799 seconds and on PSO-reduced dataset is only 326 seconds (more than 209 times faster).

We applied ensemble classifiers such as bagging and boosting as a comparison. Our experiment shows that bagging and boosting do not give a significant improvement. The average improvement of bagging when applied to nine datasets is only 0.85% while boosting average improvement is 1.14%. Ensemble classifiers (both bagging and boosting) outperforms single classifier in 6 of 9 datasets. SVM has been proven to perform much better when dealing with high dimensional datasets and numerical features. Although SVM work well with default value, the performance of SVM can be improved significantly using parameter optimization. Our experiment shows SVM parameter optimization using grid search always finds near optimal parameter combination within the given ranges. SVM parameter optimization using grid search is very powerful and it is able to improve the accuracy significantly. Unfortunately, grid search is very slow; therefore it is very reliable only in low dimensional dataset with few parameters. SVM parameter optimization using Evolutionary Algorithm (EA) can be used to solve the problem of grid search. EA has proven to be more stable than grid search. Based on average running time, EA is almost 16 times faster than grid search (294 seconds compare to 4680 seconds). Overall, SVM with parameter optimization outperforms other algorithms in 5 of 9 datasets. However, SVM does not perform well in datasets which have non-numerical attributes.

high dimensional data, feature selection, ensemble classifiers, support vector machine, evolutionary algorithms, parameter optimization

Syarif, Iwan

d6c3eb92-73cf-463b-819c-d97d017e54b5

March 2014

Syarif, Iwan

d6c3eb92-73cf-463b-819c-d97d017e54b5

Prugel-Bennett, Adam

b107a151-1751-4d8b-b8db-2c395ac4e14e

Syarif, Iwan (2014) Comprehensive review of classification algorithms for high dimensional datasets. University of Southampton, Physical Sciences and Engineering, Doctoral Thesis, 114pp.

Record type: Thesis (Doctoral)

Abstract

Text

__userfiles.soton.ac.uk_Users_slb1_mydesktop___soton.ac.uk_ude_personalfiles_users_jo1d13_mydesktop_Syarif.pdf - Other

Download (1MB)

More information

Published date: March 2014

Keywords: high dimensional data, feature selection, ensemble classifiers, support vector machine, evolutionary algorithms, parameter optimization

Organisations: Web & Internet Science

Identifiers

Local EPrints ID: 379927

URI: http://eprints.soton.ac.uk/id/eprint/379927

PURE UUID: a90af8e3-6237-4363-9449-b90b99ad313b

Catalogue record

Date deposited: 17 Aug 2015 13:00

Last modified: 21 Aug 2025 09:23

Export record

Share this record

Share this on Facebook Share this on Twitter Share this on Weibo

Contributors

Author: Iwan Syarif

Thesis advisor: Adam Prugel-Bennett

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Library staff additional information