The University of Southampton
University of Southampton Institutional Repository

Outlier and Anomaly Detection Methods with Applications to the 2021 Census

Outlier and Anomaly Detection Methods with Applications to the 2021 Census
Outlier and Anomaly Detection Methods with Applications to the 2021 Census
The Office of National Statistics (ONS) contracted the University of Southampton to conduct research concerning the use of statistical and data science methods for the automatic detection of outliers and anomalies in Census data. This project considered both Census 2011, which was based mostly on traditional survey methods, and Census 2021, which was mostly conducted using online surveys. The ONS has since given us permission to publish the findings of this project. This information is licensed under the Open Government Licence v3.0. To view this licence, visit http://www.nationalarchives.gov.uk/doc/open‐government‐licence/version/3/ This research project is in close collaboration with ONS and has run under two phases. These are: Phase 1: Literature review & selection of methods for detection of outliers and anomalies in Census 2021 data Phase 2: Prototype demonstrator of outlier and anomaly detection in Census 2021 data This document is the final report of phase 2 concerning the testing of statistical and data science methods, selected during phase 1, for the detection of outliers and anomalies in Census 2021 data. These methods were investigated using synthetic data perturbations to simulate anomalies which are likely to occur in the census data. The data perturbation strategies were decided in consultation with experts at ONS. The second phase of the project has been conducted with the following procedures in consultation with ONS:  The acquisition of experimental data in this study was achieved by accessing “2011 Census Microdata LA data”, which were made available by the UK Data Service.  The experimental 2011 census data had already been processed and cleaned, with no expected anomalies to be present. Thus, in order to assess our anomaly detection methods on the census data, anomalies needed to be synthetically added in this project.  Several discussions with ONS experts led us to strategize on how to synthetically add anomalies in the data. Data perturbations were performed in accord with real errors that occurred in the previous Census.  A significant number of selected potential methods (both statistical and data science‐based) for the detection of outliers and anomalies were investigated using the newly perturbed 2011 Census data. Their respective performances for the detection of census data anomalies were obtained.  Benchmarking for the Spark implementations of the selected outlier detection methods was performed. This early testbed experiment revealed scalability trends over increasing volumes and complexities of census records.  The various outlier detection scripts were integrated onto the Jupyter environment as the first prototype demonstrator for ONS.  Three major research programmes have been identified for future studies: Methods for Census Data Perturbation, Outlier and Anomaly Detection Methods and Machine Learning Strategies, and Methods Scalability using Spark Technology. Each of the topics are discussed in Section 6 with future recommendations.
Sabeur, Zoheir
74b55ff0-94cc-4624-84d5-bb816a7c9be6
Correndo, Gianluca
fea0843a-6d4a-4136-8784-0d023fcde3e2
Veres, Galina V
3c2a37d2-3904-43ce-b0cf-006f62b87337
Smith, Paul A.
a2548525-4f99-4baf-a4d0-2b216cce059c
Dawber, James
85c7c036-2ae3-4c57-a8b3-9f5223cd4da6
Sabeur, Zoheir
74b55ff0-94cc-4624-84d5-bb816a7c9be6
Correndo, Gianluca
fea0843a-6d4a-4136-8784-0d023fcde3e2
Veres, Galina V
3c2a37d2-3904-43ce-b0cf-006f62b87337
Smith, Paul A.
a2548525-4f99-4baf-a4d0-2b216cce059c
Dawber, James
85c7c036-2ae3-4c57-a8b3-9f5223cd4da6

Sabeur, Zoheir, Correndo, Gianluca, Veres, Galina V, Smith, Paul A. and Dawber, James (2021) Outlier and Anomaly Detection Methods with Applications to the 2021 Census 28pp.

Record type: Monograph (Project Report)

Abstract

The Office of National Statistics (ONS) contracted the University of Southampton to conduct research concerning the use of statistical and data science methods for the automatic detection of outliers and anomalies in Census data. This project considered both Census 2011, which was based mostly on traditional survey methods, and Census 2021, which was mostly conducted using online surveys. The ONS has since given us permission to publish the findings of this project. This information is licensed under the Open Government Licence v3.0. To view this licence, visit http://www.nationalarchives.gov.uk/doc/open‐government‐licence/version/3/ This research project is in close collaboration with ONS and has run under two phases. These are: Phase 1: Literature review & selection of methods for detection of outliers and anomalies in Census 2021 data Phase 2: Prototype demonstrator of outlier and anomaly detection in Census 2021 data This document is the final report of phase 2 concerning the testing of statistical and data science methods, selected during phase 1, for the detection of outliers and anomalies in Census 2021 data. These methods were investigated using synthetic data perturbations to simulate anomalies which are likely to occur in the census data. The data perturbation strategies were decided in consultation with experts at ONS. The second phase of the project has been conducted with the following procedures in consultation with ONS:  The acquisition of experimental data in this study was achieved by accessing “2011 Census Microdata LA data”, which were made available by the UK Data Service.  The experimental 2011 census data had already been processed and cleaned, with no expected anomalies to be present. Thus, in order to assess our anomaly detection methods on the census data, anomalies needed to be synthetically added in this project.  Several discussions with ONS experts led us to strategize on how to synthetically add anomalies in the data. Data perturbations were performed in accord with real errors that occurred in the previous Census.  A significant number of selected potential methods (both statistical and data science‐based) for the detection of outliers and anomalies were investigated using the newly perturbed 2011 Census data. Their respective performances for the detection of census data anomalies were obtained.  Benchmarking for the Spark implementations of the selected outlier detection methods was performed. This early testbed experiment revealed scalability trends over increasing volumes and complexities of census records.  The various outlier detection scripts were integrated onto the Jupyter environment as the first prototype demonstrator for ONS.  Three major research programmes have been identified for future studies: Methods for Census Data Perturbation, Outlier and Anomaly Detection Methods and Machine Learning Strategies, and Methods Scalability using Spark Technology. Each of the topics are discussed in Section 6 with future recommendations.

Text
census anomalies paper - Other
Available under License Other.
Download (539kB)

More information

Published date: 13 October 2021

Identifiers

Local EPrints ID: 451775
URI: http://eprints.soton.ac.uk/id/eprint/451775
PURE UUID: 68fcc5e3-d3a2-4868-a42a-ae35e4901d98
ORCID for Zoheir Sabeur: ORCID iD orcid.org/0000-0003-4325-4871
ORCID for Gianluca Correndo: ORCID iD orcid.org/0000-0003-3335-5759
ORCID for Paul A. Smith: ORCID iD orcid.org/0000-0001-5337-2746

Catalogue record

Date deposited: 27 Oct 2021 16:30
Last modified: 17 Mar 2024 03:36

Export record

Contributors

Author: Zoheir Sabeur ORCID iD
Author: Gianluca Correndo ORCID iD
Author: Galina V Veres
Author: Paul A. Smith ORCID iD
Author: James Dawber

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of http://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×