Abstract
Background: Endometriosis is a complex health condition with an array of physical and psychological symptoms, often leading to multimorbidity. Multimorbidity consists of the co-existence of two or more chronic medical conditions in one individual without any condition being considered an index condition, and therefore could be prevented if the initial conditions are managed effectively. It is a remarkably challenging heath condition and a good understanding of the complex mechanisms involved could enable timely diagnosis and effective management plans. This study aimed to develop an exploratory machine learning model that can predict multimorbidity among endometriosis women using real-world and synthetic data.
Methods: A sample size of 1012 was used from 2 endometriosis specialized centers in the UK. The patients record included
large spectrum of variables, such as patient demographics, symptoms, diseases, previous treatments, and conditions in women with a confirmed diagnosis of endometriosis. In addition, 1000 more synthetic data records, for each center, was generated using a widely used synthetic Data Vault’s Gaussian Copula model using the data characteristic from patients’ records. Three standard classification models Logistic Regression (LR), Support Vector Machine (SVM) Random Forest (RF), were used for classification based on their intrinsic behavior in separating/classifying data. Hence, their performance was compared on realworld
and synthetic data. All models were trained on both synthetic and real-world data but tested using real-world data. Their performance was assessed using quality assessment test, heatmaps and average accuracies.
Results: The quality assessment test and heatmaps comparing synthetic and real-world datasets show that the synthetic data follow the same distribution. The average accuracies for all three models (LR, SVM and RF), given as “model accuracy-centre1:accuracy-centre2” was found to be: LR 64.26%:69.04%, SVM 67.35%:68.61%, and RF 58.67%:73.76% on real-world data and LR 69.9%:72.29%, SVM 69.39%:70.13, and RF 68.88%:74.62 on synthetic data, respectively.
Conclusion: The findings of this exploratory study show that machine learning models trained on synthetic data performed better than models trained on real-world data. This suggests that synthetic data shows much promise for conducting clinical epidemiology and clinical trials that could devise better precision treatments for endometriosis and, possibly prevent multimorbidity.
Methods: A sample size of 1012 was used from 2 endometriosis specialized centers in the UK. The patients record included
large spectrum of variables, such as patient demographics, symptoms, diseases, previous treatments, and conditions in women with a confirmed diagnosis of endometriosis. In addition, 1000 more synthetic data records, for each center, was generated using a widely used synthetic Data Vault’s Gaussian Copula model using the data characteristic from patients’ records. Three standard classification models Logistic Regression (LR), Support Vector Machine (SVM) Random Forest (RF), were used for classification based on their intrinsic behavior in separating/classifying data. Hence, their performance was compared on realworld
and synthetic data. All models were trained on both synthetic and real-world data but tested using real-world data. Their performance was assessed using quality assessment test, heatmaps and average accuracies.
Results: The quality assessment test and heatmaps comparing synthetic and real-world datasets show that the synthetic data follow the same distribution. The average accuracies for all three models (LR, SVM and RF), given as “model accuracy-centre1:accuracy-centre2” was found to be: LR 64.26%:69.04%, SVM 67.35%:68.61%, and RF 58.67%:73.76% on real-world data and LR 69.9%:72.29%, SVM 69.39%:70.13, and RF 68.88%:74.62 on synthetic data, respectively.
Conclusion: The findings of this exploratory study show that machine learning models trained on synthetic data performed better than models trained on real-world data. This suggests that synthetic data shows much promise for conducting clinical epidemiology and clinical trials that could devise better precision treatments for endometriosis and, possibly prevent multimorbidity.
| Original language | English |
|---|---|
| Pages (from-to) | 655-670 |
| Journal | American Journal of Biomedical Science & Research |
| Volume | 22 |
| Issue number | 5 |
| DOIs | |
| Publication status | Published - 28 May 2024 |