The University of Southampton
University of Southampton Institutional Repository

Automated data cleaning of paediatric anthropometric data from longitudinal electronic health records: protocol and application to a large patient cohort

Automated data cleaning of paediatric anthropometric data from longitudinal electronic health records: protocol and application to a large patient cohort
Automated data cleaning of paediatric anthropometric data from longitudinal electronic health records: protocol and application to a large patient cohort

'Big data' in healthcare encompass measurements collated from multiple sources with various degrees of data quality. These data require quality control assessment to optimise quality for clinical management and for robust large-scale data analysis in healthcare research. Height and weight data represent one of the most abundantly recorded health statistics. The shift to electronic recording of anthropometric measurements in electronic healthcare records, has rapidly inflated the number of measurements. WHO guidelines inform removal of population-based extreme outliers but an absence of tools limits cleaning of longitudinal anthropometric measurements. We developed and optimised a protocol for cleaning paediatric height and weight data that incorporates outlier detection using robust linear regression methodology using a manually curated set of 6,279 patients' longitudinal measurements. The protocol was then applied to a cohort of 200,000 patient records collected from 60,000 paediatric patients attending a regional teaching hospital in South England. WHO guidelines detected biologically implausible data in <1% of records. Additional error rates of 3% and 0.2% for height and weight respectively were detected using the protocol. Inflated error rates for height measurements were largely due to small but physiologically implausible decreases in height. Lowest error rates were observed when data was measured and digitally recorded by staff routinely required to do so. The protocol successfully automates the parsing of implausible and poor quality height and weight data from a voluminous longitudinal dataset and standardises the quality assessment of data for clinical and research applications.

2045-2322
Phan, Hang Thi Thu
2811b94c-62b7-459d-9cc1-c88057008e3b
Borca, Florina
31fc3965-6bcf-4fd6-85bc-8b0f99f62473
Cable, David
91192b85-6469-4495-9c34-637556a49cfc
Batchelor, James
e53c36c7-aa7f-4fae-8113-30bfbb9b36ee
Davies, Justin
9f18fcad-f488-4c72-ac23-c154995443a9
Ennis, Sarah
7b57f188-9d91-4beb-b217-09856146f1e9
Phan, Hang Thi Thu
2811b94c-62b7-459d-9cc1-c88057008e3b
Borca, Florina
31fc3965-6bcf-4fd6-85bc-8b0f99f62473
Cable, David
91192b85-6469-4495-9c34-637556a49cfc
Batchelor, James
e53c36c7-aa7f-4fae-8113-30bfbb9b36ee
Davies, Justin
9f18fcad-f488-4c72-ac23-c154995443a9
Ennis, Sarah
7b57f188-9d91-4beb-b217-09856146f1e9

Phan, Hang Thi Thu, Borca, Florina, Cable, David, Batchelor, James, Davies, Justin and Ennis, Sarah (2020) Automated data cleaning of paediatric anthropometric data from longitudinal electronic health records: protocol and application to a large patient cohort. Scientific Reports, 10 (1), [10164]. (doi:10.1038/s41598-020-66925-7).

Record type: Article

Abstract

'Big data' in healthcare encompass measurements collated from multiple sources with various degrees of data quality. These data require quality control assessment to optimise quality for clinical management and for robust large-scale data analysis in healthcare research. Height and weight data represent one of the most abundantly recorded health statistics. The shift to electronic recording of anthropometric measurements in electronic healthcare records, has rapidly inflated the number of measurements. WHO guidelines inform removal of population-based extreme outliers but an absence of tools limits cleaning of longitudinal anthropometric measurements. We developed and optimised a protocol for cleaning paediatric height and weight data that incorporates outlier detection using robust linear regression methodology using a manually curated set of 6,279 patients' longitudinal measurements. The protocol was then applied to a cohort of 200,000 patient records collected from 60,000 paediatric patients attending a regional teaching hospital in South England. WHO guidelines detected biologically implausible data in <1% of records. Additional error rates of 3% and 0.2% for height and weight respectively were detected using the protocol. Inflated error rates for height measurements were largely due to small but physiologically implausible decreases in height. Lowest error rates were observed when data was measured and digitally recorded by staff routinely required to do so. The protocol successfully automates the parsing of implausible and poor quality height and weight data from a voluminous longitudinal dataset and standardises the quality assessment of data for clinical and research applications.

Text
20200408_AnthropometricDataCleaningPaper_Draft - Accepted Manuscript
Download (85kB)
Text
20200401_AnthropometricDataCleaningPaper_Supplementary - Accepted Manuscript
Download (4MB)
Image
AnthropometricManuscript_Figure1 - Accepted Manuscript
Download (589kB)
Image
AnthropometricManuscript_Figure2 - Accepted Manuscript
Restricted to Repository staff only
Request a copy
Image
AnthropometricManuscript_Figure3 - Accepted Manuscript
Restricted to Repository staff only
Request a copy
Image
AnthropometricManuscript_Figure4 - Accepted Manuscript
Restricted to Repository staff only
Request a copy
Image
AnthropometricManuscript_Figure5 - Accepted Manuscript
Restricted to Repository staff only
Request a copy

Show all 7 downloads.

More information

Accepted/In Press date: 13 May 2020
Published date: 1 December 2020
Additional Information: Funding Information: This work was supported by the NIHR BRC Infrastructure award, grant number IS_BRC-1215_20004. Publisher Copyright: © 2020, The Author(s).

Identifiers

Local EPrints ID: 442017
URI: http://eprints.soton.ac.uk/id/eprint/442017
ISSN: 2045-2322
PURE UUID: 4794bda6-b08b-482c-93e4-4897e7f48ac8
ORCID for James Batchelor: ORCID iD orcid.org/0000-0002-5307-552X
ORCID for Sarah Ennis: ORCID iD orcid.org/0000-0003-2648-0869

Catalogue record

Date deposited: 03 Jul 2020 16:39
Last modified: 17 Mar 2024 05:34

Export record

Altmetrics

Contributors

Author: Hang Thi Thu Phan
Author: Florina Borca
Author: David Cable
Author: James Batchelor ORCID iD
Author: Justin Davies
Author: Sarah Ennis ORCID iD

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of http://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×