Automatic multilabel detection of ICD10 codes in Dutch cardiology discharge letters using neural networks
Automatic multilabel detection of ICD10 codes in Dutch cardiology discharge letters using neural networks
Standard reference terminology of diagnoses and risk factors is crucial for billing, epidemiological studies, and inter/intranational comparisons of diseases. The International Classification of Disease (ICD) is a standardized and widely used method, but the manual classification is an enormously time-consuming endeavor. Natural language processing together with machine learning allows automated structuring of diagnoses using ICD-10 codes, but the limited performance of machine learning models, the necessity of gigantic datasets, and poor reliability of terminal parts of these codes restricted clinical usability. We aimed to create a high performing pipeline for automated classification of reliable ICD-10 codes in the free medical text in cardiology. We focussed on frequently used and well-defined three- and four-digit ICD-10 codes that still have enough granularity to be clinically relevant such as atrial fibrillation (I48), acute myocardial infarction (I21), or dilated cardiomyopathy (I42.0). Our pipeline uses a deep neural network known as a Bidirectional Gated Recurrent Unit Neural Network and was trained and tested with 5548 discharge letters and validated in 5089 discharge and procedural letters. As in clinical practice discharge letters may be labeled with more than one code, we assessed the single- and multilabel performance of main diagnoses and cardiovascular risk factors. We investigated using both the entire body of text and only the summary paragraph, supplemented by age and sex. Given the privacy-sensitive information included in discharge letters, we added a de-identification step. The performance was high, with F1 scores of 0.76–0.99 for three-character and 0.87–0.98 for four-character ICD-10 codes, and was best when using complete discharge letters. Adding variables age/sex did not affect results. For model interpretability, word coefficients were provided and qualitative assessment of classification was manually performed. Because of its high performance, this pipeline can be useful to decrease the administrative burden of classifying discharge diagnoses and may serve as a scaffold for reimbursement and research applications.
Sammani, Arjan
1977b3ae-6218-4f33-9e98-8ab4c045f86a
Bagheri, Ayoub
d074530b-a149-41d1-9c20-8c7f7590d0fa
Van Der Heijden, Peter
85157917-3b33-4683-81be-713f987fd612
te Riele, Anneline S. J. M.
0519cf38-be09-47c6-a100-949cdba92a6c
Baas, Annette F.
94a790e8-6e98-4e63-b3cc-31d1a24024d0
Oosters, C. A. J.
ffb81176-d474-4db3-8007-443cc42c43f8
Oberski, Daniel
435ab3ca-2bea-4db0-b554-3f14108ffc12
Asselbergs, Folkert W.
21b4962c-6d28-4794-91ba-258493ba64c7
December 2021
Sammani, Arjan
1977b3ae-6218-4f33-9e98-8ab4c045f86a
Bagheri, Ayoub
d074530b-a149-41d1-9c20-8c7f7590d0fa
Van Der Heijden, Peter
85157917-3b33-4683-81be-713f987fd612
te Riele, Anneline S. J. M.
0519cf38-be09-47c6-a100-949cdba92a6c
Baas, Annette F.
94a790e8-6e98-4e63-b3cc-31d1a24024d0
Oosters, C. A. J.
ffb81176-d474-4db3-8007-443cc42c43f8
Oberski, Daniel
435ab3ca-2bea-4db0-b554-3f14108ffc12
Asselbergs, Folkert W.
21b4962c-6d28-4794-91ba-258493ba64c7
Sammani, Arjan, Bagheri, Ayoub, Van Der Heijden, Peter, te Riele, Anneline S. J. M., Baas, Annette F., Oosters, C. A. J., Oberski, Daniel and Asselbergs, Folkert W.
(2021)
Automatic multilabel detection of ICD10 codes in Dutch cardiology discharge letters using neural networks.
npj Digital Medicine, 4 (1), [37].
(doi:10.1038/s41746-021-00404-9).
Abstract
Standard reference terminology of diagnoses and risk factors is crucial for billing, epidemiological studies, and inter/intranational comparisons of diseases. The International Classification of Disease (ICD) is a standardized and widely used method, but the manual classification is an enormously time-consuming endeavor. Natural language processing together with machine learning allows automated structuring of diagnoses using ICD-10 codes, but the limited performance of machine learning models, the necessity of gigantic datasets, and poor reliability of terminal parts of these codes restricted clinical usability. We aimed to create a high performing pipeline for automated classification of reliable ICD-10 codes in the free medical text in cardiology. We focussed on frequently used and well-defined three- and four-digit ICD-10 codes that still have enough granularity to be clinically relevant such as atrial fibrillation (I48), acute myocardial infarction (I21), or dilated cardiomyopathy (I42.0). Our pipeline uses a deep neural network known as a Bidirectional Gated Recurrent Unit Neural Network and was trained and tested with 5548 discharge letters and validated in 5089 discharge and procedural letters. As in clinical practice discharge letters may be labeled with more than one code, we assessed the single- and multilabel performance of main diagnoses and cardiovascular risk factors. We investigated using both the entire body of text and only the summary paragraph, supplemented by age and sex. Given the privacy-sensitive information included in discharge letters, we added a de-identification step. The performance was high, with F1 scores of 0.76–0.99 for three-character and 0.87–0.98 for four-character ICD-10 codes, and was best when using complete discharge letters. Adding variables age/sex did not affect results. For model interpretability, word coefficients were provided and qualitative assessment of classification was manually performed. Because of its high performance, this pipeline can be useful to decrease the administrative burden of classifying discharge diagnoses and may serve as a scaffold for reimbursement and research applications.
Text
Automatic multilabel
- Version of Record
Text
NPJ DM (002)
- Other
Other
Acceptance of NPJDIGITALMED-01173R1
Restricted to Repository staff only
Request a copy
More information
Accepted/In Press date: 26 January 2021
e-pub ahead of print date: 26 February 2021
Published date: December 2021
Additional Information:
Funding Information:
Authors thank Leslie Beks, Danielle Klokman and Annemiek Tuntelder for their efforts as correctors and medical coders without whom this ICD-10 dataset would not have existed. Arjan Sammani is supported by the Alexandre Suerman Stipendium and CVON 2015-12 eDETECT YTP. Anneline te Riele is supported by the Dutch Heart Foundation (2015T058), the UMC Utrecht Fellowship Clinical Research Talent and CVON 2015-12 eDETECT. Annette Baas is supported by Netherlands Heart Foundation (Dekker 2015T041). Folkert Asselbergs is supported by UCL Hospitals NIHR Biomedical Research Centre. This study was funded by the Dutch Heart Foundation (CVON-AI: 2018B017) and by the focus area Applied Data Science at Utrecht University, The Netherlands.
Publisher Copyright:
© 2021, The Author(s).
Identifiers
Local EPrints ID: 447310
URI: http://eprints.soton.ac.uk/id/eprint/447310
ISSN: 2398-6352
PURE UUID: 468cfad3-cc73-4ebd-9b21-10b634f1cbc4
Catalogue record
Date deposited: 09 Mar 2021 17:31
Last modified: 17 Mar 2024 03:31
Export record
Altmetrics
Contributors
Author:
Arjan Sammani
Author:
Ayoub Bagheri
Author:
Anneline S. J. M. te Riele
Author:
Annette F. Baas
Author:
C. A. J. Oosters
Author:
Daniel Oberski
Author:
Folkert W. Asselbergs
Download statistics
Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.
View more statistics