Automated data cleaning of paediatric anthropometric data from longitudinal electronic health records: protocol and application to a large patient cohort
Automated data cleaning of paediatric anthropometric data from longitudinal electronic health records: protocol and application to a large patient cohort
'Big data' in healthcare encompass measurements collated from multiple sources with various degrees of data quality. These data require quality control assessment to optimise quality for clinical management and for robust large-scale data analysis in healthcare research. Height and weight data represent one of the most abundantly recorded health statistics. The shift to electronic recording of anthropometric measurements in electronic healthcare records, has rapidly inflated the number of measurements. WHO guidelines inform removal of population-based extreme outliers but an absence of tools limits cleaning of longitudinal anthropometric measurements. We developed and optimised a protocol for cleaning paediatric height and weight data that incorporates outlier detection using robust linear regression methodology using a manually curated set of 6,279 patients' longitudinal measurements. The protocol was then applied to a cohort of 200,000 patient records collected from 60,000 paediatric patients attending a regional teaching hospital in South England. WHO guidelines detected biologically implausible data in <1% of records. Additional error rates of 3% and 0.2% for height and weight respectively were detected using the protocol. Inflated error rates for height measurements were largely due to small but physiologically implausible decreases in height. Lowest error rates were observed when data was measured and digitally recorded by staff routinely required to do so. The protocol successfully automates the parsing of implausible and poor quality height and weight data from a voluminous longitudinal dataset and standardises the quality assessment of data for clinical and research applications.
Phan, Hang Thi Thu
2811b94c-62b7-459d-9cc1-c88057008e3b
Borca, Florina
31fc3965-6bcf-4fd6-85bc-8b0f99f62473
Cable, David
91192b85-6469-4495-9c34-637556a49cfc
Batchelor, James
e53c36c7-aa7f-4fae-8113-30bfbb9b36ee
Davies, Justin
9f18fcad-f488-4c72-ac23-c154995443a9
Ennis, Sarah
7b57f188-9d91-4beb-b217-09856146f1e9
1 December 2020
Phan, Hang Thi Thu
2811b94c-62b7-459d-9cc1-c88057008e3b
Borca, Florina
31fc3965-6bcf-4fd6-85bc-8b0f99f62473
Cable, David
91192b85-6469-4495-9c34-637556a49cfc
Batchelor, James
e53c36c7-aa7f-4fae-8113-30bfbb9b36ee
Davies, Justin
9f18fcad-f488-4c72-ac23-c154995443a9
Ennis, Sarah
7b57f188-9d91-4beb-b217-09856146f1e9
Phan, Hang Thi Thu, Borca, Florina, Cable, David, Batchelor, James, Davies, Justin and Ennis, Sarah
(2020)
Automated data cleaning of paediatric anthropometric data from longitudinal electronic health records: protocol and application to a large patient cohort.
Scientific Reports, 10 (1), [10164].
(doi:10.1038/s41598-020-66925-7).
Abstract
'Big data' in healthcare encompass measurements collated from multiple sources with various degrees of data quality. These data require quality control assessment to optimise quality for clinical management and for robust large-scale data analysis in healthcare research. Height and weight data represent one of the most abundantly recorded health statistics. The shift to electronic recording of anthropometric measurements in electronic healthcare records, has rapidly inflated the number of measurements. WHO guidelines inform removal of population-based extreme outliers but an absence of tools limits cleaning of longitudinal anthropometric measurements. We developed and optimised a protocol for cleaning paediatric height and weight data that incorporates outlier detection using robust linear regression methodology using a manually curated set of 6,279 patients' longitudinal measurements. The protocol was then applied to a cohort of 200,000 patient records collected from 60,000 paediatric patients attending a regional teaching hospital in South England. WHO guidelines detected biologically implausible data in <1% of records. Additional error rates of 3% and 0.2% for height and weight respectively were detected using the protocol. Inflated error rates for height measurements were largely due to small but physiologically implausible decreases in height. Lowest error rates were observed when data was measured and digitally recorded by staff routinely required to do so. The protocol successfully automates the parsing of implausible and poor quality height and weight data from a voluminous longitudinal dataset and standardises the quality assessment of data for clinical and research applications.
Text
20200408_AnthropometricDataCleaningPaper_Draft
- Accepted Manuscript
Text
20200401_AnthropometricDataCleaningPaper_Supplementary
- Accepted Manuscript
Image
AnthropometricManuscript_Figure1
- Accepted Manuscript
Image
AnthropometricManuscript_Figure2
- Accepted Manuscript
Restricted to Repository staff only
Request a copy
Image
AnthropometricManuscript_Figure3
- Accepted Manuscript
Restricted to Repository staff only
Request a copy
Image
AnthropometricManuscript_Figure4
- Accepted Manuscript
Restricted to Repository staff only
Request a copy
Image
AnthropometricManuscript_Figure5
- Accepted Manuscript
Restricted to Repository staff only
Request a copy
Show all 7 downloads.
More information
Accepted/In Press date: 13 May 2020
Published date: 1 December 2020
Additional Information:
Funding Information:
This work was supported by the NIHR BRC Infrastructure award, grant number IS_BRC-1215_20004.
Publisher Copyright:
© 2020, The Author(s).
Identifiers
Local EPrints ID: 442017
URI: http://eprints.soton.ac.uk/id/eprint/442017
ISSN: 2045-2322
PURE UUID: 4794bda6-b08b-482c-93e4-4897e7f48ac8
Catalogue record
Date deposited: 03 Jul 2020 16:39
Last modified: 17 Mar 2024 05:34
Export record
Altmetrics
Contributors
Author:
Hang Thi Thu Phan
Author:
Florina Borca
Author:
David Cable
Download statistics
Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.
View more statistics