The University of Southampton
University of Southampton Institutional Repository

An experimental comparison of classification algorithms for imbalanced credit scoring data sets

An experimental comparison of classification algorithms for imbalanced credit scoring data sets
An experimental comparison of classification algorithms for imbalanced credit scoring data sets
In this paper, we set out to compare several techniques that can be used in the analysis of imbalanced credit scoring data sets. In a credit scoring context, imbalanced data sets frequently occur as the number of defaulting loans in a portfolio is usually much lower than the number of observations that do not default. As well as using traditional classification techniques such as logistic regression, neural networks and decision trees, this paper will also explore the suitability of gradient boosting, least square support vector machines and random forests for loan default prediction.

Five real-world credit scoring data sets are used to build classifiers and test their performance. In our experiments, we progressively increase class imbalance in each of these data sets by randomly under-sampling the minority class of defaulters, so as to identify to what extent the predictive power of the respective techniques is adversely affected. The performance criterion chosen to measure this effect is the area under the receiver operating characteristic curve (AUC); Friedman’s statistic and Nemenyi post hoc tests are used to test for significance of AUC differences between techniques.

The results from this empirical study indicate that the random forest and gradient boosting classifiers perform very well in a credit scoring context and are able to cope comparatively well with pronounced class imbalances in these data sets. We also found that, when faced with a large class imbalance, the C4.5 decision tree algorithm, quadratic discriminant analysis and k-nearest neighbours perform significantly worse than the best performing classifiers.

0957-4174
3446-3453
Brown, I.
13b13988-789f-40e2-a2d8-8f326f68eedd
Mues, C.
07438e46-bad6-48ba-8f56-f945bc2ff934
Brown, I.
13b13988-789f-40e2-a2d8-8f326f68eedd
Mues, C.
07438e46-bad6-48ba-8f56-f945bc2ff934

Brown, I. and Mues, C. (2012) An experimental comparison of classification algorithms for imbalanced credit scoring data sets. Expert Systems with Applications, 39 (3), 3446-3453. (doi:10.1016/j.eswa.2011.09.033).

Record type: Article

Abstract

In this paper, we set out to compare several techniques that can be used in the analysis of imbalanced credit scoring data sets. In a credit scoring context, imbalanced data sets frequently occur as the number of defaulting loans in a portfolio is usually much lower than the number of observations that do not default. As well as using traditional classification techniques such as logistic regression, neural networks and decision trees, this paper will also explore the suitability of gradient boosting, least square support vector machines and random forests for loan default prediction.

Five real-world credit scoring data sets are used to build classifiers and test their performance. In our experiments, we progressively increase class imbalance in each of these data sets by randomly under-sampling the minority class of defaulters, so as to identify to what extent the predictive power of the respective techniques is adversely affected. The performance criterion chosen to measure this effect is the area under the receiver operating characteristic curve (AUC); Friedman’s statistic and Nemenyi post hoc tests are used to test for significance of AUC differences between techniques.

The results from this empirical study indicate that the random forest and gradient boosting classifiers perform very well in a credit scoring context and are able to cope comparatively well with pronounced class imbalances in these data sets. We also found that, when faced with a large class imbalance, the C4.5 decision tree algorithm, quadratic discriminant analysis and k-nearest neighbours perform significantly worse than the best performing classifiers.

This record has no associated files available for download.

More information

Published date: 15 February 2012
Organisations: Southampton Business School

Identifiers

Local EPrints ID: 204741
URI: http://eprints.soton.ac.uk/id/eprint/204741
ISSN: 0957-4174
PURE UUID: 98ae86e1-0564-4c27-8b92-d33b659cf697
ORCID for C. Mues: ORCID iD orcid.org/0000-0002-6289-5490

Catalogue record

Date deposited: 30 Nov 2011 09:58
Last modified: 15 Mar 2024 03:20

Export record

Altmetrics

Contributors

Author: I. Brown
Author: C. Mues ORCID iD

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of http://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×