The University of Southampton
University of Southampton Institutional Repository

From rule-based to DeepSeek R1 – a robust comparative evaluation of fifty years of natural language processing (NLP) models to identify inflammatory bowel disease cohorts

From rule-based to DeepSeek R1 – a robust comparative evaluation of fifty years of natural language processing (NLP) models to identify inflammatory bowel disease cohorts
From rule-based to DeepSeek R1 – a robust comparative evaluation of fifty years of natural language processing (NLP) models to identify inflammatory bowel disease cohorts
Background: natural language processing (NLP) can identify cohorts of patients with inflammatory bowel disease (IBD) from free text. However, limited sharing of code, models, and datasets continues to hinder progress, and bias in foundation large language models (LLMs) remains a significant obstacle.

Objective: to evaluate 15 open-source NLP models for identifying IBD cohorts, reporting on document-to-patient-level classification, while exploring explainability, generalisability, bias and cost factors.

Design: fifteen algorithms were assessed, covering fifty years of NLP development: regular expressions, Spacy, bag of words (BOW), term frequency inverse document frequency (TF IDF), Word2Vec, two sentence-based SBERT models, three BERT models (distilBERT, RoBERTa, bioclinicalBERT), and five large language models (LLMs): [Mistral-Instruct-0.3-7B, M42-Health/Llama3-8B, Deepseek-R1-Distill-Qwen-32B, Qwen3-32B, and Deepseek-R1-Distill-Llama-70B]. Models were evaluated based on F1 score, bias, environmental costs (in grams of CO2 emitted), and explainability.

Results: a total of 9311 labelled documents were evaluated. The fine-tuned DistilBERT model achieved the best performance (F1: 94.06%) and was more efficient (230.1g CO2) than all other BERT and LLM models. BOW was also strong (F1: 93.38%) and very low cost (1.63g CO2). LLMs performed less well (F1: 86.65% to 91.58%) and had a higher compute cost (938.5 to 33884.4g CO2), along with more bias.

Conclusion: older NLP approaches, such as BOW, can outperform modern LLMs in clinical cohort detection when properly trained. While LLMs do not require task-specific pretraining, they are slower, more costly, and less accurate. All models and weights from this study are released as open source to benefit the research community.
Stammers, Matthew
a4ad3bd5-7323-4a6d-9c00-2c34f8ae5bd3
Gwiggner, Markus
af72b597-1ead-4155-a25c-0835f7e560c2
Nouraei, Reza
f09047ee-ed51-495d-a257-11837e74c2b3
Metcalf, Cheryl
95774dba-f27e-4bc6-bb7e-68a24f7ea051
Batchelor, James
e53c36c7-aa7f-4fae-8113-30bfbb9b36ee
Stammers, Matthew
a4ad3bd5-7323-4a6d-9c00-2c34f8ae5bd3
Gwiggner, Markus
af72b597-1ead-4155-a25c-0835f7e560c2
Nouraei, Reza
f09047ee-ed51-495d-a257-11837e74c2b3
Metcalf, Cheryl
95774dba-f27e-4bc6-bb7e-68a24f7ea051
Batchelor, James
e53c36c7-aa7f-4fae-8113-30bfbb9b36ee

Stammers, Matthew, Gwiggner, Markus, Nouraei, Reza, Metcalf, Cheryl and Batchelor, James (2025) From rule-based to DeepSeek R1 – a robust comparative evaluation of fifty years of natural language processing (NLP) models to identify inflammatory bowel disease cohorts 28pp. (doi:10.1101/2025.07.06.25330961).

Record type: Monograph (Working Paper)

Abstract

Background: natural language processing (NLP) can identify cohorts of patients with inflammatory bowel disease (IBD) from free text. However, limited sharing of code, models, and datasets continues to hinder progress, and bias in foundation large language models (LLMs) remains a significant obstacle.

Objective: to evaluate 15 open-source NLP models for identifying IBD cohorts, reporting on document-to-patient-level classification, while exploring explainability, generalisability, bias and cost factors.

Design: fifteen algorithms were assessed, covering fifty years of NLP development: regular expressions, Spacy, bag of words (BOW), term frequency inverse document frequency (TF IDF), Word2Vec, two sentence-based SBERT models, three BERT models (distilBERT, RoBERTa, bioclinicalBERT), and five large language models (LLMs): [Mistral-Instruct-0.3-7B, M42-Health/Llama3-8B, Deepseek-R1-Distill-Qwen-32B, Qwen3-32B, and Deepseek-R1-Distill-Llama-70B]. Models were evaluated based on F1 score, bias, environmental costs (in grams of CO2 emitted), and explainability.

Results: a total of 9311 labelled documents were evaluated. The fine-tuned DistilBERT model achieved the best performance (F1: 94.06%) and was more efficient (230.1g CO2) than all other BERT and LLM models. BOW was also strong (F1: 93.38%) and very low cost (1.63g CO2). LLMs performed less well (F1: 86.65% to 91.58%) and had a higher compute cost (938.5 to 33884.4g CO2), along with more bias.

Conclusion: older NLP approaches, such as BOW, can outperform modern LLMs in clinical cohort detection when properly trained. While LLMs do not require task-specific pretraining, they are slower, more costly, and less accurate. All models and weights from this study are released as open source to benefit the research community.

Text
2025.07.06.25330961v1.full - Author's Original
Download (435kB)

More information

e-pub ahead of print date: 7 July 2025

Identifiers

Local EPrints ID: 508845
URI: http://eprints.soton.ac.uk/id/eprint/508845
PURE UUID: 39915b51-a783-4210-a072-07589d2fded4
ORCID for Matthew Stammers: ORCID iD orcid.org/0000-0003-3850-3116
ORCID for James Batchelor: ORCID iD orcid.org/0000-0002-5307-552X

Catalogue record

Date deposited: 04 Feb 2026 17:56
Last modified: 05 Feb 2026 03:12

Export record

Altmetrics

Contributors

Author: Matthew Stammers ORCID iD
Author: Markus Gwiggner
Author: Reza Nouraei
Author: Cheryl Metcalf
Author: James Batchelor ORCID iD

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of http://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×