Automated data cleaning of paediatric anthropometric data from longitudinal electronic health records: protocol and application to a large patient cohort

'Big data' in healthcare encompass measurements collated from multiple sources with various degrees of data quality. These data require quality control assessment to optimise quality for clinical management and for robust large-scale data analysis in healthcare research. Height and weight data represent one of the most abundantly recorded health statistics. The shift to electronic recording of anthropometric measurements in electronic healthcare records, has rapidly inflated the number of measurements. WHO guidelines inform removal of population-based extreme outliers but an absence of tools limits cleaning of longitudinal anthropometric measurements. We developed and optimised a protocol for cleaning paediatric height and weight data that incorporates outlier detection using robust linear regression methodology using a manually curated set of 6,279 patients' longitudinal measurements. The protocol was then applied to a cohort of 200,000 patient records collected from 60,000 paediatric patients attending a regional teaching hospital in South England. WHO guidelines detected biologically implausible data in <1% of records. Additional error rates of 3% and 0.2% for height and weight respectively were detected using the protocol. Inflated error rates for height measurements were largely due to small but physiologically implausible decreases in height. Lowest error rates were observed when data was measured and digitally recorded by staff routinely required to do so. The protocol successfully automates the parsing of implausible and poor quality height and weight data from a voluminous longitudinal dataset and standardises the quality assessment of data for clinical and research applications.

10.1038/s41598-020-66925-7

2045-2322

Phan, Hang Thi Thu

2811b94c-62b7-459d-9cc1-c88057008e3b

Borca, Florina

31fc3965-6bcf-4fd6-85bc-8b0f99f62473

Cable, David

91192b85-6469-4495-9c34-637556a49cfc

Batchelor, James

e53c36c7-aa7f-4fae-8113-30bfbb9b36ee

Davies, Justin

9f18fcad-f488-4c72-ac23-c154995443a9

Ennis, Sarah

7b57f188-9d91-4beb-b217-09856146f1e9

1 December 2020

Phan, Hang Thi Thu

2811b94c-62b7-459d-9cc1-c88057008e3b

Borca, Florina

31fc3965-6bcf-4fd6-85bc-8b0f99f62473

Cable, David

91192b85-6469-4495-9c34-637556a49cfc

Batchelor, James

e53c36c7-aa7f-4fae-8113-30bfbb9b36ee

Davies, Justin

9f18fcad-f488-4c72-ac23-c154995443a9

Ennis, Sarah

7b57f188-9d91-4beb-b217-09856146f1e9

Phan, Hang Thi Thu, Borca, Florina, Cable, David, Batchelor, James, Davies, Justin and Ennis, Sarah (2020) Automated data cleaning of paediatric anthropometric data from longitudinal electronic health records: protocol and application to a large patient cohort. Scientific Reports, 10 (1), [10164]. (doi:10.1038/s41598-020-66925-7).

Record type: Article

Abstract

Text

20200408_AnthropometricDataCleaningPaper_Draft - Accepted Manuscript

Available under License University of Southampton Accepted Manuscript Licence.

Download (85kB)

Text

20200401_AnthropometricDataCleaningPaper_Supplementary - Accepted Manuscript

Available under License University of Southampton Accepted Manuscript Licence.

Download (4MB)

Image

AnthropometricManuscript_Figure1 - Accepted Manuscript

Download (589kB)

Image

AnthropometricManuscript_Figure2 - Accepted Manuscript

Restricted to Repository staff only

Request a copy

Image

AnthropometricManuscript_Figure3 - Accepted Manuscript

Restricted to Repository staff only

Request a copy

Image

AnthropometricManuscript_Figure4 - Accepted Manuscript

Restricted to Repository staff only

Request a copy

Image

AnthropometricManuscript_Figure5 - Accepted Manuscript

Restricted to Repository staff only

Request a copy

Show all 7 downloads.

More information

Accepted/In Press date: 13 May 2020

Published date: 1 December 2020

Learn more about Institute for Life Sciences research

Identifiers

Local EPrints ID: 442017

URI: http://eprints.soton.ac.uk/id/eprint/442017

DOI: doi:10.1038/s41598-020-66925-7

ISSN: 2045-2322

PURE UUID: 4794bda6-b08b-482c-93e4-4897e7f48ac8

ORCID for James Batchelor:

orcid.org/0000-0002-5307-552X

ORCID for Sarah Ennis:

orcid.org/0000-0003-2648-0869

Catalogue record

Date deposited: 03 Jul 2020 16:39

Last modified: 11 Jul 2025 04:03

Export record

Altmetrics

Share this record

Share this on Facebook Share this on Twitter Share this on Weibo

Contributors

Author: Hang Thi Thu Phan

Author: Florina Borca

Author: David Cable

Author: James Batchelor

Author: Justin Davies

Author: Sarah Ennis

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Library staff additional information