Benchmarking conventional outlier detection methods
Benchmarking conventional outlier detection methods
Nowadays, businesses in many industries face an increasing flow of data and information. Data are at the core of the decision-making process, hence it is vital to ensure that the data are of high quality and no noise is present. Outlier detection methods are aimed to find unusual patterns in data and find their applications in many practical domains. These methods employ different techniques, ranging from pure statistical tools to deep learning models that have gained popularity in recent years. Moreover, one of the most popular outlier detection techniques are machine learning models. They have several characteristics which affect the potential of their usefulness in real-life scenarios. The goal of this paper is to add to the existing body of research on outlier detection by comparing the isolation forest, DBSCAN and LOF techniques. Thus, we investigate the research question: which ones of these outlier detection models perform best in practical business applications. To this end, three models are built on 12 datasets and compared using 5 performance metrics. The final comparison of the models is based on the McNemar’s test, as well as on ranks per performance measure and on average. Three main conclusions can be made from the benchmarking study. First, the models considered in this research disagree differently, i.e. their type I and type II errors are not similar. Second, considering the time, AUPRC and sensitivity metrics, the iForest model is ranked the highest. Hence, the iForest model is the best in the cases when time performance is a key consideration as well as when the opportunity costs of not detecting an outlier are high. Third, the DBSCAN model obtains the highest ranking along the F1 score and precision dimensions. That allows us to conclude that if raising many false alarms is not an important concern, the DBSCAN model is the best to employ.
DBSCAN, Local Outlier Factor, Outlier detection, iForest
597-613
Tiukhova, Elena
d892421d-5c0a-4091-9af2-a738e71518e7
Reusens, Manon
3dc14c4b-793a-41d6-b7bd-64303cda1c42
Baesens, Bart
f7c6496b-aa7f-4026-8616-ca61d9e216f0
Snoeck, Monique
9aee96bc-8a57-4c37-bcd7-e83f0b173ee1
14 May 2022
Tiukhova, Elena
d892421d-5c0a-4091-9af2-a738e71518e7
Reusens, Manon
3dc14c4b-793a-41d6-b7bd-64303cda1c42
Baesens, Bart
f7c6496b-aa7f-4026-8616-ca61d9e216f0
Snoeck, Monique
9aee96bc-8a57-4c37-bcd7-e83f0b173ee1
Tiukhova, Elena, Reusens, Manon, Baesens, Bart and Snoeck, Monique
(2022)
Benchmarking conventional outlier detection methods.
In,
Guizzardi, Renata, Ralyté, Jolita and Franch, Xavier
(eds.)
Research Challenges in Information Science.
(Research Challenges in Information Science, 446)
.
(doi:10.1007/978-3-031-05760-1_35).
Record type:
Book Section
Abstract
Nowadays, businesses in many industries face an increasing flow of data and information. Data are at the core of the decision-making process, hence it is vital to ensure that the data are of high quality and no noise is present. Outlier detection methods are aimed to find unusual patterns in data and find their applications in many practical domains. These methods employ different techniques, ranging from pure statistical tools to deep learning models that have gained popularity in recent years. Moreover, one of the most popular outlier detection techniques are machine learning models. They have several characteristics which affect the potential of their usefulness in real-life scenarios. The goal of this paper is to add to the existing body of research on outlier detection by comparing the isolation forest, DBSCAN and LOF techniques. Thus, we investigate the research question: which ones of these outlier detection models perform best in practical business applications. To this end, three models are built on 12 datasets and compared using 5 performance metrics. The final comparison of the models is based on the McNemar’s test, as well as on ranks per performance measure and on average. Three main conclusions can be made from the benchmarking study. First, the models considered in this research disagree differently, i.e. their type I and type II errors are not similar. Second, considering the time, AUPRC and sensitivity metrics, the iForest model is ranked the highest. Hence, the iForest model is the best in the cases when time performance is a key consideration as well as when the opportunity costs of not detecting an outlier are high. Third, the DBSCAN model obtains the highest ranking along the F1 score and precision dimensions. That allows us to conclude that if raising many false alarms is not an important concern, the DBSCAN model is the best to employ.
Text
Benchmarking conventional outlier detection methods
- Accepted Manuscript
More information
Accepted/In Press date: 18 April 2022
Published date: 14 May 2022
Additional Information:
Funding Information:
Supported by the ING Group.. Acknowledgements. The research was sponsored by the ING Chair on Applying Deep Learning on Metadata as a Competitive Accelerator.
Publisher Copyright:
© 2022, The Author(s), under exclusive license to Springer Nature Switzerland AG.
Keywords:
DBSCAN, Local Outlier Factor, Outlier detection, iForest
Identifiers
Local EPrints ID: 458073
URI: http://eprints.soton.ac.uk/id/eprint/458073
ISSN: 1865-1348
PURE UUID: f710d81f-8b6d-4309-a989-a606e51f7daa
Catalogue record
Date deposited: 28 Jun 2022 16:35
Last modified: 17 Mar 2024 07:21
Export record
Altmetrics
Contributors
Author:
Elena Tiukhova
Author:
Manon Reusens
Author:
Monique Snoeck
Editor:
Renata Guizzardi
Editor:
Jolita Ralyté
Editor:
Xavier Franch
Download statistics
Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.
View more statistics