The University of Southampton
University of Southampton Institutional Repository

Application of generative artificial intelligence to utilise unstructured clinical data for acceleration of inflammatory bowel disease research

Application of generative artificial intelligence to utilise unstructured clinical data for acceleration of inflammatory bowel disease research
Application of generative artificial intelligence to utilise unstructured clinical data for acceleration of inflammatory bowel disease research
Background: inflammatory bowel disease (IBD) research is a dynamic field. However, the growing volume of electronic health records (EHRs) and research data presents significant challenges. Traditional methods for structuring unstructured medical records are labour-intensive and lack scalability. Large language models (LLMs) may present a solution, yet their usefulness in data standardisation in the context of IBD remains unknown.

Objective: to evaluate the use of LLMs in structuring free-text histology and radiology reports from IBD patients, compare their performance to manual clinician curation, and assess the usefulness of fine-tuning and retrieval-augmented generation (RAG).

Design: we developed an IBD-specialised LLM-based framework utilising structured prompt engineering and fine-tuning. Reports were manually curated and processed using various LLMs. Performance was assessed and RAG was used to enhance model responses with clinical guidelines from European Crohn’s and Colitis Organisation (ECCO) and the European Society for Paediatric Gastroenterology Hepatology and Nutrition (ESPGHAN).

Results: overall, Llama 3.3 achieved the highest F1 for histology and imaging (1 ± 0 and 0.85 ± 0.29, respectively) in extracting findings and anatomical regions, surpassing other models in structured data generation. Fine-tuning improved the performance of the smaller Llama 3.1 8B model for imaging reports (0.7 ± 0.46 vs 0.82 ± 0.35), enabling better extraction with reduced computational requirements.

Conclusion: our findings demonstrate the feasibility of LLM-based automated structuring of IBD-related medical records. Unstructured data from free text reports can be reliably converted to standardised ontologies with location, severity, and qualifiers. These advancements enable scalable, privacy-compliant AI-driven solutions for data standardisation.
medRxiv
Kadhim, Alex Z.
a70585d6-5470-48c4-a4c1-2c261c5183c4
Green, Zachary
b3269022-c0a6-42db-859d-d92c4cc5f4f0
Nazari, Iman
b2ec0c70-a591-47ca-9131-8c526fb999b2
Baker, Jonathan
eeac94ac-d265-4350-a882-b6cc088eb141
George, Michael
5bd91b32-01fd-4cf1-bc24-f3c4102865c3
Heinson, Ashley
74a31857-482c-4543-833e-53629a02e14a
Stammers, Matt
85e202da-1879-4f96-8e24-5059a1fa3f1e
Kipps, Christopher M.
14aea786-0bf1-4107-812d-6914e74f7a96
Beattie, R. Mark
55d81c7b-08c9-4f42-b6d3-245869badb71
Ashton, James J.
1c0bfa29-794c-4fd5-93e0-6769e6037d72
Ennis, Sarah
7b57f188-9d91-4beb-b217-09856146f1e9
Kadhim, Alex Z.
a70585d6-5470-48c4-a4c1-2c261c5183c4
Green, Zachary
b3269022-c0a6-42db-859d-d92c4cc5f4f0
Nazari, Iman
b2ec0c70-a591-47ca-9131-8c526fb999b2
Baker, Jonathan
eeac94ac-d265-4350-a882-b6cc088eb141
George, Michael
5bd91b32-01fd-4cf1-bc24-f3c4102865c3
Heinson, Ashley
74a31857-482c-4543-833e-53629a02e14a
Stammers, Matt
85e202da-1879-4f96-8e24-5059a1fa3f1e
Kipps, Christopher M.
14aea786-0bf1-4107-812d-6914e74f7a96
Beattie, R. Mark
55d81c7b-08c9-4f42-b6d3-245869badb71
Ashton, James J.
1c0bfa29-794c-4fd5-93e0-6769e6037d72
Ennis, Sarah
7b57f188-9d91-4beb-b217-09856146f1e9

[Unknown type: UNSPECIFIED]

Record type: UNSPECIFIED

Abstract

Background: inflammatory bowel disease (IBD) research is a dynamic field. However, the growing volume of electronic health records (EHRs) and research data presents significant challenges. Traditional methods for structuring unstructured medical records are labour-intensive and lack scalability. Large language models (LLMs) may present a solution, yet their usefulness in data standardisation in the context of IBD remains unknown.

Objective: to evaluate the use of LLMs in structuring free-text histology and radiology reports from IBD patients, compare their performance to manual clinician curation, and assess the usefulness of fine-tuning and retrieval-augmented generation (RAG).

Design: we developed an IBD-specialised LLM-based framework utilising structured prompt engineering and fine-tuning. Reports were manually curated and processed using various LLMs. Performance was assessed and RAG was used to enhance model responses with clinical guidelines from European Crohn’s and Colitis Organisation (ECCO) and the European Society for Paediatric Gastroenterology Hepatology and Nutrition (ESPGHAN).

Results: overall, Llama 3.3 achieved the highest F1 for histology and imaging (1 ± 0 and 0.85 ± 0.29, respectively) in extracting findings and anatomical regions, surpassing other models in structured data generation. Fine-tuning improved the performance of the smaller Llama 3.1 8B model for imaging reports (0.7 ± 0.46 vs 0.82 ± 0.35), enabling better extraction with reduced computational requirements.

Conclusion: our findings demonstrate the feasibility of LLM-based automated structuring of IBD-related medical records. Unstructured data from free text reports can be reliably converted to standardised ontologies with location, severity, and qualifiers. These advancements enable scalable, privacy-compliant AI-driven solutions for data standardisation.

Text
Kadhim_et_al_IBD_LLM_Pre-approval_v3 - Author's Original
Restricted to Repository staff only
Request a copy

More information

Submitted date: 10 March 2025

Identifiers

Local EPrints ID: 506677
URI: http://eprints.soton.ac.uk/id/eprint/506677
PURE UUID: 761e4dac-5663-4c2d-a784-6ac811897c24
ORCID for Alex Z. Kadhim: ORCID iD orcid.org/0000-0001-5600-0411
ORCID for Zachary Green: ORCID iD orcid.org/0000-0002-2907-5538
ORCID for Sarah Ennis: ORCID iD orcid.org/0000-0003-2648-0869

Catalogue record

Date deposited: 13 Nov 2025 17:51
Last modified: 27 Nov 2025 03:09

Export record

Altmetrics

Contributors

Author: Alex Z. Kadhim ORCID iD
Author: Zachary Green ORCID iD
Author: Iman Nazari
Author: Jonathan Baker
Author: Michael George
Author: Ashley Heinson
Author: Matt Stammers
Author: Christopher M. Kipps
Author: R. Mark Beattie
Author: James J. Ashton
Author: Sarah Ennis ORCID iD

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of http://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×