The University of Southampton
University of Southampton Institutional Repository

Robust comparative evaluation of 15 natural language processing algorithms to positively identify patients with inflammatory bowel disease from secondary care records

Robust comparative evaluation of 15 natural language processing algorithms to positively identify patients with inflammatory bowel disease from secondary care records
Robust comparative evaluation of 15 natural language processing algorithms to positively identify patients with inflammatory bowel disease from secondary care records

Objective: natural language processing (NLP) can identify cohorts of patients with inflammatory bowel disease (IBD) from free text. However, limited sharing of code, models, and data sets continues to hinder progress. The aim of this study was to evaluate multiple open-source NLP models for identifying IBD cohorts, reporting on document-to-patient-level classification, while exploring explainability, generalisability, fairness and cost.

Methods: 15 algorithms were assessed, covering all types of NLP spanning over 50 years of NLP development. Rule-based (regular expressions, spaCy with negation), and vector-based (bag-of-words (BoW), term frequency inverse document frequency (TF IDF), word-2-vector), to transformers: (two sentence-based sBERT models, three bidirectional encoder representations from transformers (BERT) models (distilBERT, BioclinicalBERT, RoBERTa), and five large language models (LLMs): (Mistral-Instruct-v0.3-7B, M42-Health/Llama-v3-8B, Deepseek-R1-Distill-Qwen-v2.5-32B, Qwen-v3-32B, and Deepseek-R1-Distill-Llama-v3-70B). Models were comparatively evaluated based on full confusion matrices, time/environmental costs, fairness, and explainability.

Results: a total of 9311 labelled documents were evaluated. The fine-tuned DistilBERT_IBD model achieved the best performance overall (micro F1: 93.54%), followed by sBERT-Base (micro F1: 93.05%); however, specificity was an issue for both: (67.80-64.41%) respectively. LLMs performed well, given that they had never seen the training data (micro F1: 86.47-92.20%), but were comparatively slow (18-300 hours) and expensive. Bias was a significant issue for all effective model types.

Conclusion: NLP has undergone significant advancements over the last 50 years. LLMs appear likely to solve the problem of re-identifying patients with IBD from clinical free text sources in the future. Once cost, performance and bias issues are addressed, they and their successors are likely to become the primary method of data retrieval for clinical data warehousing.

Algorithms, Data Mining/methods, Electronic Health Records/statistics & numerical data, Humans, Inflammatory Bowel Diseases/diagnosis, Natural Language Processing, IBD MODELS, ARTIFICIAL INTELLIGENCE, IBD
2054-4774
Stammers, Matt
a4ad3bd5-7323-4a6d-9c00-2c34f8ae5bd3
Gwiggner, Markus
af72b597-1ead-4155-a25c-0835f7e560c2
Nouraei, Reza
f09047ee-ed51-495d-a257-11837e74c2b3
Metcalf, Cheryl
09a47264-8bd5-43bd-a93e-177992c22c72
Batchelor, James
e53c36c7-aa7f-4fae-8113-30bfbb9b36ee
Stammers, Matt
a4ad3bd5-7323-4a6d-9c00-2c34f8ae5bd3
Gwiggner, Markus
af72b597-1ead-4155-a25c-0835f7e560c2
Nouraei, Reza
f09047ee-ed51-495d-a257-11837e74c2b3
Metcalf, Cheryl
09a47264-8bd5-43bd-a93e-177992c22c72
Batchelor, James
e53c36c7-aa7f-4fae-8113-30bfbb9b36ee

Stammers, Matt, Gwiggner, Markus, Nouraei, Reza, Metcalf, Cheryl and Batchelor, James (2025) Robust comparative evaluation of 15 natural language processing algorithms to positively identify patients with inflammatory bowel disease from secondary care records. BMJ Open Gastroenterology, 12 (1), [e001977]. (doi:10.1136/bmjgast-2025-001977).

Record type: Article

Abstract

Objective: natural language processing (NLP) can identify cohorts of patients with inflammatory bowel disease (IBD) from free text. However, limited sharing of code, models, and data sets continues to hinder progress. The aim of this study was to evaluate multiple open-source NLP models for identifying IBD cohorts, reporting on document-to-patient-level classification, while exploring explainability, generalisability, fairness and cost.

Methods: 15 algorithms were assessed, covering all types of NLP spanning over 50 years of NLP development. Rule-based (regular expressions, spaCy with negation), and vector-based (bag-of-words (BoW), term frequency inverse document frequency (TF IDF), word-2-vector), to transformers: (two sentence-based sBERT models, three bidirectional encoder representations from transformers (BERT) models (distilBERT, BioclinicalBERT, RoBERTa), and five large language models (LLMs): (Mistral-Instruct-v0.3-7B, M42-Health/Llama-v3-8B, Deepseek-R1-Distill-Qwen-v2.5-32B, Qwen-v3-32B, and Deepseek-R1-Distill-Llama-v3-70B). Models were comparatively evaluated based on full confusion matrices, time/environmental costs, fairness, and explainability.

Results: a total of 9311 labelled documents were evaluated. The fine-tuned DistilBERT_IBD model achieved the best performance overall (micro F1: 93.54%), followed by sBERT-Base (micro F1: 93.05%); however, specificity was an issue for both: (67.80-64.41%) respectively. LLMs performed well, given that they had never seen the training data (micro F1: 86.47-92.20%), but were comparatively slow (18-300 hours) and expensive. Bias was a significant issue for all effective model types.

Conclusion: NLP has undergone significant advancements over the last 50 years. LLMs appear likely to solve the problem of re-identifying patients with IBD from clinical free text sources in the future. Once cost, performance and bias issues are addressed, they and their successors are likely to become the primary method of data retrieval for clinical data warehousing.

Text
e001977.full - Version of Record
Download (1MB)

More information

Accepted/In Press date: 26 August 2025
Published date: 10 October 2025
Keywords: Algorithms, Data Mining/methods, Electronic Health Records/statistics & numerical data, Humans, Inflammatory Bowel Diseases/diagnosis, Natural Language Processing, IBD MODELS, ARTIFICIAL INTELLIGENCE, IBD

Identifiers

Local EPrints ID: 506802
URI: http://eprints.soton.ac.uk/id/eprint/506802
ISSN: 2054-4774
PURE UUID: cd603e2d-090d-4af8-8ec7-ed64486c4cc8
ORCID for Matt Stammers: ORCID iD orcid.org/0000-0003-3850-3116
ORCID for Cheryl Metcalf: ORCID iD orcid.org/0000-0002-7404-6066
ORCID for James Batchelor: ORCID iD orcid.org/0000-0002-5307-552X

Catalogue record

Date deposited: 18 Nov 2025 17:58
Last modified: 20 Nov 2025 03:10

Export record

Altmetrics

Contributors

Author: Matt Stammers ORCID iD
Author: Markus Gwiggner
Author: Reza Nouraei
Author: Cheryl Metcalf ORCID iD
Author: James Batchelor ORCID iD

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of http://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×