Benchmarking conventional outlier detection methods

Nowadays, businesses in many industries face an increasing flow of data and information. Data are at the core of the decision-making process, hence it is vital to ensure that the data are of high quality and no noise is present. Outlier detection methods are aimed to find unusual patterns in data and find their applications in many practical domains. These methods employ different techniques, ranging from pure statistical tools to deep learning models that have gained popularity in recent years. Moreover, one of the most popular outlier detection techniques are machine learning models. They have several characteristics which affect the potential of their usefulness in real-life scenarios. The goal of this paper is to add to the existing body of research on outlier detection by comparing the isolation forest, DBSCAN and LOF techniques. Thus, we investigate the research question: which ones of these outlier detection models perform best in practical business applications. To this end, three models are built on 12 datasets and compared using 5 performance metrics. The final comparison of the models is based on the McNemar’s test, as well as on ranks per performance measure and on average. Three main conclusions can be made from the benchmarking study. First, the models considered in this research disagree differently, i.e. their type I and type II errors are not similar. Second, considering the time, AUPRC and sensitivity metrics, the iForest model is ranked the highest. Hence, the iForest model is the best in the cases when time performance is a key consideration as well as when the opportunity costs of not detecting an outlier are high. Third, the DBSCAN model obtains the highest ranking along the F1 score and precision dimensions. That allows us to conclude that if raising many false alarms is not an important concern, the DBSCAN model is the best to employ.

DBSCAN, Local Outlier Factor, Outlier detection, iForest

10.1007/978-3-031-05760-1_35

1865-1348

597-613

Tiukhova, Elena

d892421d-5c0a-4091-9af2-a738e71518e7

Reusens, Manon

3dc14c4b-793a-41d6-b7bd-64303cda1c42

Baesens, Bart

f7c6496b-aa7f-4026-8616-ca61d9e216f0

Snoeck, Monique

9aee96bc-8a57-4c37-bcd7-e83f0b173ee1

Guizzardi, Renata

Ralyté, Jolita

Franch, Xavier

14 May 2022

Tiukhova, Elena

d892421d-5c0a-4091-9af2-a738e71518e7

Reusens, Manon

3dc14c4b-793a-41d6-b7bd-64303cda1c42

Baesens, Bart

f7c6496b-aa7f-4026-8616-ca61d9e216f0

Snoeck, Monique

9aee96bc-8a57-4c37-bcd7-e83f0b173ee1

Guizzardi, Renata

Ralyté, Jolita

Franch, Xavier

Tiukhova, Elena, Reusens, Manon, Baesens, Bart and Snoeck, Monique (2022) Benchmarking conventional outlier detection methods. In, Guizzardi, Renata, Ralyté, Jolita and Franch, Xavier (eds.) Research Challenges in Information Science. (Research Challenges in Information Science, 446) pp. 597-613. (doi:10.1007/978-3-031-05760-1_35).

Record type: Book Section

Abstract

Text

Benchmarking conventional outlier detection methods - Accepted Manuscript

Available under License University of Southampton Accepted Manuscript Licence.

Download (1MB)

More information

Accepted/In Press date: 18 April 2022

Published date: 14 May 2022

Additional Information: Funding Information: Supported by the ING Group.. Acknowledgements. The research was sponsored by the ING Chair on Applying Deep Learning on Metadata as a Competitive Accelerator. Publisher Copyright: © 2022, The Author(s), under exclusive license to Springer Nature Switzerland AG.

Keywords: DBSCAN, Local Outlier Factor, Outlier detection, iForest