Robust comparative evaluation of 15 natural language processing algorithms to positively identify patients with inflammatory bowel disease from secondary care records

Objective: natural language processing (NLP) can identify cohorts of patients with inflammatory bowel disease (IBD) from free text. However, limited sharing of code, models, and data sets continues to hinder progress. The aim of this study was to evaluate multiple open-source NLP models for identifying IBD cohorts, reporting on document-to-patient-level classification, while exploring explainability, generalisability, fairness and cost.

Methods: 15 algorithms were assessed, covering all types of NLP spanning over 50 years of NLP development. Rule-based (regular expressions, spaCy with negation), and vector-based (bag-of-words (BoW), term frequency inverse document frequency (TF IDF), word-2-vector), to transformers: (two sentence-based sBERT models, three bidirectional encoder representations from transformers (BERT) models (distilBERT, BioclinicalBERT, RoBERTa), and five large language models (LLMs): (Mistral-Instruct-v0.3-7B, M42-Health/Llama-v3-8B, Deepseek-R1-Distill-Qwen-v2.5-32B, Qwen-v3-32B, and Deepseek-R1-Distill-Llama-v3-70B). Models were comparatively evaluated based on full confusion matrices, time/environmental costs, fairness, and explainability.

Results: a total of 9311 labelled documents were evaluated. The fine-tuned DistilBERT_IBD model achieved the best performance overall (micro F1: 93.54%), followed by sBERT-Base (micro F1: 93.05%); however, specificity was an issue for both: (67.80-64.41%) respectively. LLMs performed well, given that they had never seen the training data (micro F1: 86.47-92.20%), but were comparatively slow (18-300 hours) and expensive. Bias was a significant issue for all effective model types.

Conclusion: NLP has undergone significant advancements over the last 50 years. LLMs appear likely to solve the problem of re-identifying patients with IBD from clinical free text sources in the future. Once cost, performance and bias issues are addressed, they and their successors are likely to become the primary method of data retrieval for clinical data warehousing.

Algorithms, Data Mining/methods, Electronic Health Records/statistics & numerical data, Humans, Inflammatory Bowel Diseases/diagnosis, Natural Language Processing, IBD MODELS, ARTIFICIAL INTELLIGENCE, IBD

10.1136/bmjgast-2025-001977

2054-4774

Stammers, Matt

a4ad3bd5-7323-4a6d-9c00-2c34f8ae5bd3

Gwiggner, Markus

af72b597-1ead-4155-a25c-0835f7e560c2

Nouraei, Reza

f09047ee-ed51-495d-a257-11837e74c2b3

Metcalf, Cheryl

09a47264-8bd5-43bd-a93e-177992c22c72

Batchelor, James

e53c36c7-aa7f-4fae-8113-30bfbb9b36ee

10 October 2025

Stammers, Matt

a4ad3bd5-7323-4a6d-9c00-2c34f8ae5bd3

Gwiggner, Markus

af72b597-1ead-4155-a25c-0835f7e560c2

Nouraei, Reza

f09047ee-ed51-495d-a257-11837e74c2b3

Metcalf, Cheryl

09a47264-8bd5-43bd-a93e-177992c22c72

Batchelor, James

e53c36c7-aa7f-4fae-8113-30bfbb9b36ee

Stammers, Matt, Gwiggner, Markus, Nouraei, Reza, Metcalf, Cheryl and Batchelor, James (2025) Robust comparative evaluation of 15 natural language processing algorithms to positively identify patients with inflammatory bowel disease from secondary care records. BMJ Open Gastroenterology, 12 (1), [e001977]. (doi:10.1136/bmjgast-2025-001977).

Record type: Article

Abstract

Text

e001977.full - Version of Record

Available under License Creative Commons Attribution Non-commercial.

Download (1MB)

More information

Accepted/In Press date: 26 August 2025

Published date: 10 October 2025

Keywords: Algorithms, Data Mining/methods, Electronic Health Records/statistics & numerical data, Humans, Inflammatory Bowel Diseases/diagnosis, Natural Language Processing, IBD MODELS, ARTIFICIAL INTELLIGENCE, IBD