The University of Southampton
University of Southampton Institutional Repository

Exploratory report on data synchronising methods to develop machine learning-based prediction models to for multimorbidity

Exploratory report on data synchronising methods to develop machine learning-based prediction models to for multimorbidity
Exploratory report on data synchronising methods to develop machine learning-based prediction models to for multimorbidity
Endometriosis is a complex chronic condition characteristic of chronic pelvic pain, dysmenorrhea, anxiety and fatigue. This can often lead to multimorbidity which is defined by the presence of two or more long term conditions. Delayed diagnosis of endometriosis is a crucial issue that leads to poor quality of life and clinical management. There are a variety of limitations linked to conducting endometriosis research including lack of dedicated funding. Additionally, accessing existing electronic healthcare records can be challenging due to governance and regulatory restrictions. Missing data issues are another concern that has been commonly identified among real-world studies. Considering these challenges, data science technique could provide a solution by way of using synthetic datasets that could be generated using known characteristics of endometriosis to explore the possibility of predicting multimorbidity. This study aimed to develop an exploratory machine learning model that can predict multimorbidity among women with endometriosis using real-world and synthetic data. A sample size of 1012 was used from two endometriosis specialized centres in the UK. In addition, 1000 synthetic data records per centre were generated using the widely used Synthetic Data Vault’s Gaussian Copula model based on patients’ records’ characteristics. Three standard classification models, Logistic Regression (LR), Support Vector Machine (SVM), and Random Forest (RF), were used for classification. The average accuracies for all three models (LR, SVM and RF), given as “model accuracy-centre1: accuracy-centre2” were found to be: LR 64.26%:69.04%, SVM 67.35%:68.61%, and RF 58.67%:73.76% on real-world data, and LR 69.9%:72.29%, SVM 69.39%:70.13, and RF 68.88%:74.62 on synthetic data, respectively. The findings of this report show machine learning models trained on synthetic data performed better than models trained on real-world data. Our findings suggest synthetic data holds great promise for shows value to conduct clinical epidemiology and clinical trials that could devise better precision treatments and possibly reduce the burden of multimorbidity.
Preprints.Org
Delanerolle, Gayathri
ad2cf9bf-da2b-4f15-bc5d-ffbceb77786a
Benfield, David
dfd71ebe-c3ec-4130-96f2-6cc80178c3c5
Phiri, Peter
02de1b5c-df46-4231-8f81-a0e3e3e95ce7
Bouchareb, Yassine
32bbc0c2-7bc8-48e7-ad88-e80309f76295
Majumder, Kingshuk
a6746c59-d685-47d4-bcf1-f6e9d586905b
Cavalini, Heitor
ed8f6472-762e-4a94-bdf2-48b09ed995dd
Shi, Jian
568dc052-e223-425a-8ac7-489380b108f8
Kurmi, Om
18c5038f-d010-4e64-b2d4-8e430bdc98cf
Shetty, Ashish
0fe48b6e-14e6-463a-9833-c15ec8742a09
Hapanagama, Dharani K.
b44040f0-9a90-418b-92f8-3766535b072e
Zemkoho, Alain
30c79e30-9879-48bd-8d0b-e2fbbc01269e
Delanerolle, Gayathri
ad2cf9bf-da2b-4f15-bc5d-ffbceb77786a
Benfield, David
dfd71ebe-c3ec-4130-96f2-6cc80178c3c5
Phiri, Peter
02de1b5c-df46-4231-8f81-a0e3e3e95ce7
Bouchareb, Yassine
32bbc0c2-7bc8-48e7-ad88-e80309f76295
Majumder, Kingshuk
a6746c59-d685-47d4-bcf1-f6e9d586905b
Cavalini, Heitor
ed8f6472-762e-4a94-bdf2-48b09ed995dd
Shi, Jian
568dc052-e223-425a-8ac7-489380b108f8
Kurmi, Om
18c5038f-d010-4e64-b2d4-8e430bdc98cf
Shetty, Ashish
0fe48b6e-14e6-463a-9833-c15ec8742a09
Hapanagama, Dharani K.
b44040f0-9a90-418b-92f8-3766535b072e
Zemkoho, Alain
30c79e30-9879-48bd-8d0b-e2fbbc01269e

[Unknown type: UNSPECIFIED]

Record type: UNSPECIFIED

Abstract

Endometriosis is a complex chronic condition characteristic of chronic pelvic pain, dysmenorrhea, anxiety and fatigue. This can often lead to multimorbidity which is defined by the presence of two or more long term conditions. Delayed diagnosis of endometriosis is a crucial issue that leads to poor quality of life and clinical management. There are a variety of limitations linked to conducting endometriosis research including lack of dedicated funding. Additionally, accessing existing electronic healthcare records can be challenging due to governance and regulatory restrictions. Missing data issues are another concern that has been commonly identified among real-world studies. Considering these challenges, data science technique could provide a solution by way of using synthetic datasets that could be generated using known characteristics of endometriosis to explore the possibility of predicting multimorbidity. This study aimed to develop an exploratory machine learning model that can predict multimorbidity among women with endometriosis using real-world and synthetic data. A sample size of 1012 was used from two endometriosis specialized centres in the UK. In addition, 1000 synthetic data records per centre were generated using the widely used Synthetic Data Vault’s Gaussian Copula model based on patients’ records’ characteristics. Three standard classification models, Logistic Regression (LR), Support Vector Machine (SVM), and Random Forest (RF), were used for classification. The average accuracies for all three models (LR, SVM and RF), given as “model accuracy-centre1: accuracy-centre2” were found to be: LR 64.26%:69.04%, SVM 67.35%:68.61%, and RF 58.67%:73.76% on real-world data, and LR 69.9%:72.29%, SVM 69.39%:70.13, and RF 68.88%:74.62 on synthetic data, respectively. The findings of this report show machine learning models trained on synthetic data performed better than models trained on real-world data. Our findings suggest synthetic data holds great promise for shows value to conduct clinical epidemiology and clinical trials that could devise better precision treatments and possibly reduce the burden of multimorbidity.

Text
preprints202305.1337.v1 - Author's Original
Available under License Creative Commons Attribution.
Download (1MB)

More information

Submitted date: 16 May 2023
Accepted/In Press date: 18 May 2023

Identifiers

Local EPrints ID: 508584
URI: http://eprints.soton.ac.uk/id/eprint/508584
PURE UUID: 6899329c-749f-4d98-b3c6-9e3848574e7c
ORCID for Alain Zemkoho: ORCID iD orcid.org/0000-0003-1265-4178

Catalogue record

Date deposited: 27 Jan 2026 18:03
Last modified: 28 Jan 2026 03:37

Export record

Altmetrics

Contributors

Author: Gayathri Delanerolle
Author: David Benfield
Author: Peter Phiri
Author: Yassine Bouchareb
Author: Kingshuk Majumder
Author: Heitor Cavalini
Author: Jian Shi
Author: Om Kurmi
Author: Ashish Shetty
Author: Dharani K. Hapanagama
Author: Alain Zemkoho ORCID iD

Download statistics

Downloads from ePrints over the past year. Other digital versions may also be available to download e.g. from the publisher's website.

View more statistics

Atom RSS 1.0 RSS 2.0

Contact ePrints Soton: eprints@soton.ac.uk

ePrints Soton supports OAI 2.0 with a base URL of http://eprints.soton.ac.uk/cgi/oai2

This repository has been built using EPrints software, developed at the University of Southampton, but available to everyone to use.

We use cookies to ensure that we give you the best experience on our website. If you continue without changing your settings, we will assume that you are happy to receive cookies on the University of Southampton website.

×