Robust comparative evaluation of 15 natural language processing algorithms to positively identify patients with inflammatory bowel disease from secondary care records
Robust comparative evaluation of 15 natural language processing algorithms to positively identify patients with inflammatory bowel disease from secondary care records
Objective: natural language processing (NLP) can identify cohorts of patients with inflammatory bowel disease (IBD) from free text. However, limited sharing of code, models, and data sets continues to hinder progress. The aim of this study was to evaluate multiple open-source NLP models for identifying IBD cohorts, reporting on document-to-patient-level classification, while exploring explainability, generalisability, fairness and cost.
Methods: 15 algorithms were assessed, covering all types of NLP spanning over 50 years of NLP development. Rule-based (regular expressions, spaCy with negation), and vector-based (bag-of-words (BoW), term frequency inverse document frequency (TF IDF), word-2-vector), to transformers: (two sentence-based sBERT models, three bidirectional encoder representations from transformers (BERT) models (distilBERT, BioclinicalBERT, RoBERTa), and five large language models (LLMs): (Mistral-Instruct-v0.3-7B, M42-Health/Llama-v3-8B, Deepseek-R1-Distill-Qwen-v2.5-32B, Qwen-v3-32B, and Deepseek-R1-Distill-Llama-v3-70B). Models were comparatively evaluated based on full confusion matrices, time/environmental costs, fairness, and explainability.
Results: a total of 9311 labelled documents were evaluated. The fine-tuned DistilBERT_IBD model achieved the best performance overall (micro F1: 93.54%), followed by sBERT-Base (micro F1: 93.05%); however, specificity was an issue for both: (67.80-64.41%) respectively. LLMs performed well, given that they had never seen the training data (micro F1: 86.47-92.20%), but were comparatively slow (18-300 hours) and expensive. Bias was a significant issue for all effective model types.
Conclusion: NLP has undergone significant advancements over the last 50 years. LLMs appear likely to solve the problem of re-identifying patients with IBD from clinical free text sources in the future. Once cost, performance and bias issues are addressed, they and their successors are likely to become the primary method of data retrieval for clinical data warehousing.
Algorithms, Data Mining/methods, Electronic Health Records/statistics & numerical data, Humans, Inflammatory Bowel Diseases/diagnosis, Natural Language Processing, IBD MODELS, ARTIFICIAL INTELLIGENCE, IBD
Stammers, Matt
a4ad3bd5-7323-4a6d-9c00-2c34f8ae5bd3
Gwiggner, Markus
af72b597-1ead-4155-a25c-0835f7e560c2
Nouraei, Reza
f09047ee-ed51-495d-a257-11837e74c2b3
Metcalf, Cheryl
09a47264-8bd5-43bd-a93e-177992c22c72
Batchelor, James
e53c36c7-aa7f-4fae-8113-30bfbb9b36ee
10 October 2025
Stammers, Matt
a4ad3bd5-7323-4a6d-9c00-2c34f8ae5bd3
Gwiggner, Markus
af72b597-1ead-4155-a25c-0835f7e560c2
Nouraei, Reza
f09047ee-ed51-495d-a257-11837e74c2b3
Metcalf, Cheryl
09a47264-8bd5-43bd-a93e-177992c22c72
Batchelor, James
e53c36c7-aa7f-4fae-8113-30bfbb9b36ee
Stammers, Matt, Gwiggner, Markus, Nouraei, Reza, Metcalf, Cheryl and Batchelor, James
(2025)
Robust comparative evaluation of 15 natural language processing algorithms to positively identify patients with inflammatory bowel disease from secondary care records.
BMJ Open Gastroenterology, 12 (1), [e001977].
(doi:10.1136/bmjgast-2025-001977).
Abstract
Objective: natural language processing (NLP) can identify cohorts of patients with inflammatory bowel disease (IBD) from free text. However, limited sharing of code, models, and data sets continues to hinder progress. The aim of this study was to evaluate multiple open-source NLP models for identifying IBD cohorts, reporting on document-to-patient-level classification, while exploring explainability, generalisability, fairness and cost.
Methods: 15 algorithms were assessed, covering all types of NLP spanning over 50 years of NLP development. Rule-based (regular expressions, spaCy with negation), and vector-based (bag-of-words (BoW), term frequency inverse document frequency (TF IDF), word-2-vector), to transformers: (two sentence-based sBERT models, three bidirectional encoder representations from transformers (BERT) models (distilBERT, BioclinicalBERT, RoBERTa), and five large language models (LLMs): (Mistral-Instruct-v0.3-7B, M42-Health/Llama-v3-8B, Deepseek-R1-Distill-Qwen-v2.5-32B, Qwen-v3-32B, and Deepseek-R1-Distill-Llama-v3-70B). Models were comparatively evaluated based on full confusion matrices, time/environmental costs, fairness, and explainability.
Results: a total of 9311 labelled documents were evaluated. The fine-tuned DistilBERT_IBD model achieved the best performance overall (micro F1: 93.54%), followed by sBERT-Base (micro F1: 93.05%); however, specificity was an issue for both: (67.80-64.41%) respectively. LLMs performed well, given that they had never seen the training data (micro F1: 86.47-92.20%), but were comparatively slow (18-300 hours) and expensive. Bias was a significant issue for all effective model types.
Conclusion: NLP has undergone significant advancements over the last 50 years. LLMs appear likely to solve the problem of re-identifying patients with IBD from clinical free text sources in the future. Once cost, performance and bias issues are addressed, they and their successors are likely to become the primary method of data retrieval for clinical data warehousing.
Text
e001977.full
- Version of Record
More information
Accepted/In Press date: 26 August 2025
Published date: 10 October 2025
Keywords:
Algorithms, Data Mining/methods, Electronic Health Records/statistics & numerical data, Humans, Inflammatory Bowel Diseases/diagnosis, Natural Language Processing, IBD MODELS, ARTIFICIAL INTELLIGENCE, IBD
Identifiers
Local EPrints ID: 506802
URI: http://eprints.soton.ac.uk/id/eprint/506802
ISSN: 2054-4774
PURE UUID: cd603e2d-090d-4af8-8ec7-ed64486c4cc8
Catalogue record
Date deposited: 18 Nov 2025 17:58
Last modified: 20 Nov 2025 03:10
Export record
Altmetrics
Contributors
Author:
Matt Stammers
Author:
Markus Gwiggner
Author:
Reza Nouraei
Download statistics
Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.
View more statistics