The University of Southampton
University of Southampton Institutional Repository

Identification of cohorts with inflammatory bowel disease amidst fragmented clinical databases via machine learning

Identification of cohorts with inflammatory bowel disease amidst fragmented clinical databases via machine learning
Identification of cohorts with inflammatory bowel disease amidst fragmented clinical databases via machine learning

Purpose: inflammatory bowel disease (IBD) cohort identification typically relies primarily on read/billing codes, which may miss some patients. However, a complete picture cannot typically be obtained due to database fragmentation/missingness. This study used novel cohort retrieval methods to identify the total IBD cohort from a large university teaching hospital with a specialist intestinal failure unit.

Methods: between 2007 and 2023, 11 clinical databases (ICD10 codes, OPCS4 codes, clinician-entry IBD registry, IBD patient portal, prescriptions, biochemistry, flare line calls, clinic appointments, endoscopy, histopathology, and clinic letters) were identified as having the potential to help identify local patients with IBD. The 11 databases were statistically compared, and a penalized logistic regression (LR) classifier was robustly trained and validated.

Results: the gold-standard validation cohort comprised 2800 patients: 2092(75%) with IBD and 708(25%) without. All the databases contained unique patients that were not covered by the Casemix ICD-10 database. The penalizsed LR model (AUROC:0.85-Validation) confidently identified 8,159 patients with IBD (threshold: 0.496). By combining the likely true-positive predictions from the LR model with likely true-positive IBD clinic letters, a final estimate of 13,048 patients with IBD was obtained. ICD-10 codes combined with medication data identified only 8,048 patients, suggesting that present recapture methods missed 38.3% of the local cohort.

Conclusion: diagnostic billing codes and medication data alone cannot accurately identify complete cohorts of individuals with IBD in secondary care. A multimodal cross-database model can partially compensate for this deficit. However, to improve this situation in the future, more robust natural language processing (NLP)-based identification mechanisms will be required.

Cohort identification, Data fragmentation, Inflammatory bowel disease, Machine learning
0163-2116
Stammers, Matthew
a4ad3bd5-7323-4a6d-9c00-2c34f8ae5bd3
Sartain, Stephanie
6e33dd2d-b6dd-4aaa-949f-5130984626a9
Cummings, J.R. Fraser
9e0868f8-6980-4925-b474-345e478066b4
Kipps, Christopher
e43be016-2dc2-45e6-9a02-ab2a0e0208d5
Nouraei, Reza
f09047ee-ed51-495d-a257-11837e74c2b3
Gwiggner, Markus
af72b597-1ead-4155-a25c-0835f7e560c2
Metcalf, Cheryl
95774dba-f27e-4bc6-bb7e-68a24f7ea051
Batchelor, James
e53c36c7-aa7f-4fae-8113-30bfbb9b36ee
Stammers, Matthew
a4ad3bd5-7323-4a6d-9c00-2c34f8ae5bd3
Sartain, Stephanie
6e33dd2d-b6dd-4aaa-949f-5130984626a9
Cummings, J.R. Fraser
9e0868f8-6980-4925-b474-345e478066b4
Kipps, Christopher
e43be016-2dc2-45e6-9a02-ab2a0e0208d5
Nouraei, Reza
f09047ee-ed51-495d-a257-11837e74c2b3
Gwiggner, Markus
af72b597-1ead-4155-a25c-0835f7e560c2
Metcalf, Cheryl
95774dba-f27e-4bc6-bb7e-68a24f7ea051
Batchelor, James
e53c36c7-aa7f-4fae-8113-30bfbb9b36ee

Stammers, Matthew, Sartain, Stephanie, Cummings, J.R. Fraser, Kipps, Christopher, Nouraei, Reza, Gwiggner, Markus, Metcalf, Cheryl and Batchelor, James (2025) Identification of cohorts with inflammatory bowel disease amidst fragmented clinical databases via machine learning. Digestive Diseases and Sciences. (doi:10.1007/s10620-025-09323-1).

Record type: Article

Abstract

Purpose: inflammatory bowel disease (IBD) cohort identification typically relies primarily on read/billing codes, which may miss some patients. However, a complete picture cannot typically be obtained due to database fragmentation/missingness. This study used novel cohort retrieval methods to identify the total IBD cohort from a large university teaching hospital with a specialist intestinal failure unit.

Methods: between 2007 and 2023, 11 clinical databases (ICD10 codes, OPCS4 codes, clinician-entry IBD registry, IBD patient portal, prescriptions, biochemistry, flare line calls, clinic appointments, endoscopy, histopathology, and clinic letters) were identified as having the potential to help identify local patients with IBD. The 11 databases were statistically compared, and a penalized logistic regression (LR) classifier was robustly trained and validated.

Results: the gold-standard validation cohort comprised 2800 patients: 2092(75%) with IBD and 708(25%) without. All the databases contained unique patients that were not covered by the Casemix ICD-10 database. The penalizsed LR model (AUROC:0.85-Validation) confidently identified 8,159 patients with IBD (threshold: 0.496). By combining the likely true-positive predictions from the LR model with likely true-positive IBD clinic letters, a final estimate of 13,048 patients with IBD was obtained. ICD-10 codes combined with medication data identified only 8,048 patients, suggesting that present recapture methods missed 38.3% of the local cohort.

Conclusion: diagnostic billing codes and medication data alone cannot accurately identify complete cohorts of individuals with IBD in secondary care. A multimodal cross-database model can partially compensate for this deficit. However, to improve this situation in the future, more robust natural language processing (NLP)-based identification mechanisms will be required.

Text
s10620-025-09323-1 - Version of Record
Available under License Creative Commons Attribution.
Download (2MB)

More information

Accepted/In Press date: 5 August 2025
e-pub ahead of print date: 13 August 2025
Keywords: Cohort identification, Data fragmentation, Inflammatory bowel disease, Machine learning

Identifiers

Local EPrints ID: 505572
URI: http://eprints.soton.ac.uk/id/eprint/505572
ISSN: 0163-2116
PURE UUID: 59017686-9701-467b-8bfe-d655baad52a2
ORCID for Matthew Stammers: ORCID iD orcid.org/0000-0003-3850-3116
ORCID for Christopher Kipps: ORCID iD orcid.org/0000-0002-5205-9712
ORCID for James Batchelor: ORCID iD orcid.org/0000-0002-5307-552X

Catalogue record

Date deposited: 14 Oct 2025 16:38
Last modified: 15 Oct 2025 02:15

Export record

Altmetrics

Contributors

Author: Matthew Stammers ORCID iD
Author: Stephanie Sartain
Author: J.R. Fraser Cummings
Author: Christopher Kipps ORCID iD
Author: Reza Nouraei
Author: Markus Gwiggner
Author: Cheryl Metcalf
Author: James Batchelor ORCID iD

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of http://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×