The University of Southampton
University of Southampton Institutional Repository

Benchmarking conventional outlier detection methods

Benchmarking conventional outlier detection methods
Benchmarking conventional outlier detection methods

Nowadays, businesses in many industries face an increasing flow of data and information. Data are at the core of the decision-making process, hence it is vital to ensure that the data are of high quality and no noise is present. Outlier detection methods are aimed to find unusual patterns in data and find their applications in many practical domains. These methods employ different techniques, ranging from pure statistical tools to deep learning models that have gained popularity in recent years. Moreover, one of the most popular outlier detection techniques are machine learning models. They have several characteristics which affect the potential of their usefulness in real-life scenarios. The goal of this paper is to add to the existing body of research on outlier detection by comparing the isolation forest, DBSCAN and LOF techniques. Thus, we investigate the research question: which ones of these outlier detection models perform best in practical business applications. To this end, three models are built on 12 datasets and compared using 5 performance metrics. The final comparison of the models is based on the McNemar’s test, as well as on ranks per performance measure and on average. Three main conclusions can be made from the benchmarking study. First, the models considered in this research disagree differently, i.e. their type I and type II errors are not similar. Second, considering the time, AUPRC and sensitivity metrics, the iForest model is ranked the highest. Hence, the iForest model is the best in the cases when time performance is a key consideration as well as when the opportunity costs of not detecting an outlier are high. Third, the DBSCAN model obtains the highest ranking along the F1 score and precision dimensions. That allows us to conclude that if raising many false alarms is not an important concern, the DBSCAN model is the best to employ.

DBSCAN, Local Outlier Factor, Outlier detection, iForest
1865-1348
597-613
Tiukhova, Elena
d892421d-5c0a-4091-9af2-a738e71518e7
Reusens, Manon
3dc14c4b-793a-41d6-b7bd-64303cda1c42
Baesens, Bart
f7c6496b-aa7f-4026-8616-ca61d9e216f0
Snoeck, Monique
9aee96bc-8a57-4c37-bcd7-e83f0b173ee1
Guizzardi, Renata
Ralyté, Jolita
Franch, Xavier
Tiukhova, Elena
d892421d-5c0a-4091-9af2-a738e71518e7
Reusens, Manon
3dc14c4b-793a-41d6-b7bd-64303cda1c42
Baesens, Bart
f7c6496b-aa7f-4026-8616-ca61d9e216f0
Snoeck, Monique
9aee96bc-8a57-4c37-bcd7-e83f0b173ee1
Guizzardi, Renata
Ralyté, Jolita
Franch, Xavier

Tiukhova, Elena, Reusens, Manon, Baesens, Bart and Snoeck, Monique (2022) Benchmarking conventional outlier detection methods. In, Guizzardi, Renata, Ralyté, Jolita and Franch, Xavier (eds.) Research Challenges in Information Science. (Research Challenges in Information Science, 446) pp. 597-613. (doi:10.1007/978-3-031-05760-1_35).

Record type: Book Section

Abstract

Nowadays, businesses in many industries face an increasing flow of data and information. Data are at the core of the decision-making process, hence it is vital to ensure that the data are of high quality and no noise is present. Outlier detection methods are aimed to find unusual patterns in data and find their applications in many practical domains. These methods employ different techniques, ranging from pure statistical tools to deep learning models that have gained popularity in recent years. Moreover, one of the most popular outlier detection techniques are machine learning models. They have several characteristics which affect the potential of their usefulness in real-life scenarios. The goal of this paper is to add to the existing body of research on outlier detection by comparing the isolation forest, DBSCAN and LOF techniques. Thus, we investigate the research question: which ones of these outlier detection models perform best in practical business applications. To this end, three models are built on 12 datasets and compared using 5 performance metrics. The final comparison of the models is based on the McNemar’s test, as well as on ranks per performance measure and on average. Three main conclusions can be made from the benchmarking study. First, the models considered in this research disagree differently, i.e. their type I and type II errors are not similar. Second, considering the time, AUPRC and sensitivity metrics, the iForest model is ranked the highest. Hence, the iForest model is the best in the cases when time performance is a key consideration as well as when the opportunity costs of not detecting an outlier are high. Third, the DBSCAN model obtains the highest ranking along the F1 score and precision dimensions. That allows us to conclude that if raising many false alarms is not an important concern, the DBSCAN model is the best to employ.

Text
Benchmarking conventional outlier detection methods - Accepted Manuscript
Download (1MB)

More information

Accepted/In Press date: 18 April 2022
Published date: 14 May 2022
Additional Information: Funding Information: Supported by the ING Group.. Acknowledgements. The research was sponsored by the ING Chair on Applying Deep Learning on Metadata as a Competitive Accelerator. Publisher Copyright: © 2022, The Author(s), under exclusive license to Springer Nature Switzerland AG.
Keywords: DBSCAN, Local Outlier Factor, Outlier detection, iForest

Identifiers

Local EPrints ID: 458073
URI: http://eprints.soton.ac.uk/id/eprint/458073
ISSN: 1865-1348
PURE UUID: f710d81f-8b6d-4309-a989-a606e51f7daa
ORCID for Bart Baesens: ORCID iD orcid.org/0000-0002-5831-5668

Catalogue record

Date deposited: 28 Jun 2022 16:35
Last modified: 17 Mar 2024 07:21

Export record

Altmetrics

Contributors

Author: Elena Tiukhova
Author: Manon Reusens
Author: Bart Baesens ORCID iD
Author: Monique Snoeck
Editor: Renata Guizzardi
Editor: Jolita Ralyté
Editor: Xavier Franch

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of http://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×