The University of Southampton
University of Southampton Institutional Repository

Resampling methods improve the predictive power of modeling in class-imbalanced datasets

Resampling methods improve the predictive power of modeling in class-imbalanced datasets
Resampling methods improve the predictive power of modeling in class-imbalanced datasets
In the medical field, many outcome variables are dichotomized, and the two possible values of a dichotomized variable are referred to as classes. A dichotomized dataset is class-imbalanced if it consists mostly of one class, and performance of common classification models on this type of dataset tends to be suboptimal. To tackle such a problem, resampling methods, including oversampling and undersampling can be used. This paper aims at illustrating the effect of resampling methods using the National Health and Nutrition Examination Survey (NHANES) wave 2009–2010 dataset. A total of 4677 participants aged ≥20 without self-reported diabetes and with valid blood test results were analyzed. The Classification and Regression Tree (CART) procedure was used to build a classification model on undiagnosed diabetes. A participant demonstrated evidence of diabetes according to WHO diabetes criteria. Exposure variables included demographics and socio-economic status. CART models were fitted using a randomly selected 70% of the data (training dataset), and area under the receiver operating characteristic curve (AUC) was computed using the remaining 30% of the sample for evaluation (testing dataset). CART models were fitted using the training dataset, the oversampled training dataset, the weighted training dataset, and the undersampled training dataset. In addition, resampling case-to-control ratio of 1:1, 1:2, and 1:4 were examined. Resampling methods on the performance of other extensions of CART (random forests and generalized boosted trees) were also examined. CARTs fitted on the oversampled (AUC = 0.70) and undersampled training data (AUC = 0.74) yielded a better classification power than that on the training data (AUC = 0.65). Resampling could also improve the classification power of random forests and generalized boosted trees. To conclude, applying resampling methods in a class-imbalanced dataset improved the classification power of CART, random forests, and generalized boosted trees.
Automated classifier, Data mining, Decision tree, Oversampling, Predictive power, Rare events
1661-7827
9776-9789
Lee, Paul H.
02620eab-ae7f-4a1c-bad1-8a50e7e48951
Lee, Paul H.
02620eab-ae7f-4a1c-bad1-8a50e7e48951

Lee, Paul H. (2014) Resampling methods improve the predictive power of modeling in class-imbalanced datasets. International Journal of Environmental Research and Public Health, 11 (9), 9776-9789. (doi:10.3390/ijerph110909776).

Record type: Article

Abstract

In the medical field, many outcome variables are dichotomized, and the two possible values of a dichotomized variable are referred to as classes. A dichotomized dataset is class-imbalanced if it consists mostly of one class, and performance of common classification models on this type of dataset tends to be suboptimal. To tackle such a problem, resampling methods, including oversampling and undersampling can be used. This paper aims at illustrating the effect of resampling methods using the National Health and Nutrition Examination Survey (NHANES) wave 2009–2010 dataset. A total of 4677 participants aged ≥20 without self-reported diabetes and with valid blood test results were analyzed. The Classification and Regression Tree (CART) procedure was used to build a classification model on undiagnosed diabetes. A participant demonstrated evidence of diabetes according to WHO diabetes criteria. Exposure variables included demographics and socio-economic status. CART models were fitted using a randomly selected 70% of the data (training dataset), and area under the receiver operating characteristic curve (AUC) was computed using the remaining 30% of the sample for evaluation (testing dataset). CART models were fitted using the training dataset, the oversampled training dataset, the weighted training dataset, and the undersampled training dataset. In addition, resampling case-to-control ratio of 1:1, 1:2, and 1:4 were examined. Resampling methods on the performance of other extensions of CART (random forests and generalized boosted trees) were also examined. CARTs fitted on the oversampled (AUC = 0.70) and undersampled training data (AUC = 0.74) yielded a better classification power than that on the training data (AUC = 0.65). Resampling could also improve the classification power of random forests and generalized boosted trees. To conclude, applying resampling methods in a class-imbalanced dataset improved the classification power of CART, random forests, and generalized boosted trees.

This record has no associated files available for download.

More information

Published date: 18 September 2014
Additional Information: Publisher Copyright: © 2014 by the authors; licensee MDPI, Basel, Switzerland.
Keywords: Automated classifier, Data mining, Decision tree, Oversampling, Predictive power, Rare events

Identifiers

Local EPrints ID: 475228
URI: http://eprints.soton.ac.uk/id/eprint/475228
ISSN: 1661-7827
PURE UUID: 32d5638e-364d-4665-be7d-1141e35aa763
ORCID for Paul H. Lee: ORCID iD orcid.org/0000-0002-5729-6450

Catalogue record

Date deposited: 14 Mar 2023 17:45
Last modified: 17 Mar 2024 04:16

Export record

Altmetrics

Contributors

Author: Paul H. Lee ORCID iD

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of http://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×