An experimental comparison of classification algorithms for imbalanced credit scoring data sets
An experimental comparison of classification algorithms for imbalanced credit scoring data sets
In this paper, we set out to compare several techniques that can be used in the analysis of imbalanced credit scoring data sets. In a credit scoring context, imbalanced data sets frequently occur as the number of defaulting loans in a portfolio is usually much lower than the number of observations that do not default. As well as using traditional classification techniques such as logistic regression, neural networks and decision trees, this paper will also explore the suitability of gradient boosting, least square support vector machines and random forests for loan default prediction.
Five real-world credit scoring data sets are used to build classifiers and test their performance. In our experiments, we progressively increase class imbalance in each of these data sets by randomly under-sampling the minority class of defaulters, so as to identify to what extent the predictive power of the respective techniques is adversely affected. The performance criterion chosen to measure this effect is the area under the receiver operating characteristic curve (AUC); Friedman’s statistic and Nemenyi post hoc tests are used to test for significance of AUC differences between techniques.
The results from this empirical study indicate that the random forest and gradient boosting classifiers perform very well in a credit scoring context and are able to cope comparatively well with pronounced class imbalances in these data sets. We also found that, when faced with a large class imbalance, the C4.5 decision tree algorithm, quadratic discriminant analysis and k-nearest neighbours perform significantly worse than the best performing classifiers.
3446-3453
Brown, I.
13b13988-789f-40e2-a2d8-8f326f68eedd
Mues, C.
07438e46-bad6-48ba-8f56-f945bc2ff934
15 February 2012
Brown, I.
13b13988-789f-40e2-a2d8-8f326f68eedd
Mues, C.
07438e46-bad6-48ba-8f56-f945bc2ff934
Brown, I. and Mues, C.
(2012)
An experimental comparison of classification algorithms for imbalanced credit scoring data sets.
Expert Systems with Applications, 39 (3), .
(doi:10.1016/j.eswa.2011.09.033).
Abstract
In this paper, we set out to compare several techniques that can be used in the analysis of imbalanced credit scoring data sets. In a credit scoring context, imbalanced data sets frequently occur as the number of defaulting loans in a portfolio is usually much lower than the number of observations that do not default. As well as using traditional classification techniques such as logistic regression, neural networks and decision trees, this paper will also explore the suitability of gradient boosting, least square support vector machines and random forests for loan default prediction.
Five real-world credit scoring data sets are used to build classifiers and test their performance. In our experiments, we progressively increase class imbalance in each of these data sets by randomly under-sampling the minority class of defaulters, so as to identify to what extent the predictive power of the respective techniques is adversely affected. The performance criterion chosen to measure this effect is the area under the receiver operating characteristic curve (AUC); Friedman’s statistic and Nemenyi post hoc tests are used to test for significance of AUC differences between techniques.
The results from this empirical study indicate that the random forest and gradient boosting classifiers perform very well in a credit scoring context and are able to cope comparatively well with pronounced class imbalances in these data sets. We also found that, when faced with a large class imbalance, the C4.5 decision tree algorithm, quadratic discriminant analysis and k-nearest neighbours perform significantly worse than the best performing classifiers.
This record has no associated files available for download.
More information
Published date: 15 February 2012
Organisations:
Southampton Business School
Identifiers
Local EPrints ID: 204741
URI: http://eprints.soton.ac.uk/id/eprint/204741
ISSN: 0957-4174
PURE UUID: 98ae86e1-0564-4c27-8b92-d33b659cf697
Catalogue record
Date deposited: 30 Nov 2011 09:58
Last modified: 15 Mar 2024 03:20
Export record
Altmetrics
Contributors
Author:
I. Brown
Download statistics
Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.
View more statistics