Handling Overlapping Asymmetric Data Sets—A Twice Penalized P-Spline Approach

Matthew McTeer; Robin Henderson; Quentin M. Anstee; Paolo Missier

doi:10.3390/math12050777

Handling Overlapping Asymmetric Data Sets—A Twice Penalized P-Spline Approach

Matthew McTeer^*, Robin Henderson, Quentin M. Anstee, Paolo Missier

^*Corresponding author for this work

Computer Science

Research output: Contribution to journal › Article › peer-review

5 Downloads (Pure)

Abstract

Aims: Overlapping asymmetric data sets are where a large cohort of observations have a small amount of information recorded, and within this group there exists a smaller cohort which have extensive further information available. Missing imputation is unwise if cohort size differs substantially; therefore, we aim to develop a way of modelling the smaller cohort whilst considering the larger.

Methods: Through considering traditionally once penalized P-Spline approximations, we create a second penalty term through observing discrepancies in the marginal value of covariates that exist in both cohorts. Our now twice penalized P-Spline is designed to firstly prevent over/under-fitting of the smaller cohort and secondly to consider the larger cohort.

Results: Through a series of data simulations, penalty parameter tunings, and model adaptations, our twice penalized model offers up to a 58% and 46% improvement in model fit upon a continuous and binary response, respectively, against existing B-Spline and once penalized P-Spline methods. Applying our model to an individual’s risk of developing steatohepatitis, we report an over 65% improvement over existing methods.

Conclusions: We propose a twice penalized P-Spline method which can vastly improve the model fit of overlapping asymmetric data sets upon a common predictive endpoint, without the need for missing data imputation.

Original language	English
Article number	777
Number of pages	33
Journal	Mathematics
Volume	12
Issue number	5
DOIs	https://doi.org/10.3390/math12050777
Publication status	Published - 5 Mar 2024

Bibliographical note

Funding:
This work was supported by Newcastle University and Red Hat UK. This work has been supported by the LITMUS project, which has received funding from the Innovative Medicines Initiative 2 Joint Undertaking under grant agreement No. 777377. This Joint Undertaking receives support from the European Union’s Horizon 2020 research and innovation programme and EFPIA. QMA is an NIHR Senior Investigator and is supported by the Newcastle NIHR Biomedical Research Centre. This communication reflects the view of the authors and neither IMI nor the European Union and EFPIA are liable for any use that may be made of the information contained herein.

Keywords

P-Spline
penalized regression
smoothing
asymmetric data
B-Spline
non-Parametric
MASLD
MASH
health data science

Access to Document

10.3390/math12050777Licence: Creative Commons: Attribution (CC BY)

McTeerM2024HandlingFinal published version, 2.03 MBLicence: Creative Commons: Attribution (CC BY)

H2020_COLLAB (IMI)_LITMUS
Newsome, P.
European Commission
1/11/17 → 29/02/24
Project: EU

Cite this

@article{0ac86acbdd1041c6ad01dd059e96ce78,

title = "Handling Overlapping Asymmetric Data Sets—A Twice Penalized P-Spline Approach",

abstract = "Aims: Overlapping asymmetric data sets are where a large cohort of observations have a small amount of information recorded, and within this group there exists a smaller cohort which have extensive further information available. Missing imputation is unwise if cohort size differs substantially; therefore, we aim to develop a way of modelling the smaller cohort whilst considering the larger. Methods: Through considering traditionally once penalized P-Spline approximations, we create a second penalty term through observing discrepancies in the marginal value of covariates that exist in both cohorts. Our now twice penalized P-Spline is designed to firstly prevent over/under-fitting of the smaller cohort and secondly to consider the larger cohort. Results: Through a series of data simulations, penalty parameter tunings, and model adaptations, our twice penalized model offers up to a 58% and 46% improvement in model fit upon a continuous and binary response, respectively, against existing B-Spline and once penalized P-Spline methods. Applying our model to an individual{\textquoteright}s risk of developing steatohepatitis, we report an over 65% improvement over existing methods. Conclusions: We propose a twice penalized P-Spline method which can vastly improve the model fit of overlapping asymmetric data sets upon a common predictive endpoint, without the need for missing data imputation.",

keywords = "P-Spline, penalized regression, smoothing, asymmetric data, B-Spline, non-Parametric, MASLD, MASH, health data science",

author = "Matthew McTeer and Robin Henderson and Anstee, {Quentin M.} and Paolo Missier",

note = "Funding: This work was supported by Newcastle University and Red Hat UK. This work has been supported by the LITMUS project, which has received funding from the Innovative Medicines Initiative 2 Joint Undertaking under grant agreement No. 777377. This Joint Undertaking receives support from the European Union{\textquoteright}s Horizon 2020 research and innovation programme and EFPIA. QMA is an NIHR Senior Investigator and is supported by the Newcastle NIHR Biomedical Research Centre. This communication reflects the view of the authors and neither IMI nor the European Union and EFPIA are liable for any use that may be made of the information contained herein.",

year = "2024",

month = mar,

day = "5",

doi = "10.3390/math12050777",

language = "English",

volume = "12",

journal = "Mathematics",

issn = "2227-7390",

publisher = "MDPI",

number = "5",

}

TY - JOUR

T1 - Handling Overlapping Asymmetric Data Sets—A Twice Penalized P-Spline Approach

AU - McTeer, Matthew

AU - Henderson, Robin

AU - Anstee, Quentin M.

AU - Missier, Paolo

N1 - Funding: This work was supported by Newcastle University and Red Hat UK. This work has been supported by the LITMUS project, which has received funding from the Innovative Medicines Initiative 2 Joint Undertaking under grant agreement No. 777377. This Joint Undertaking receives support from the European Union’s Horizon 2020 research and innovation programme and EFPIA. QMA is an NIHR Senior Investigator and is supported by the Newcastle NIHR Biomedical Research Centre. This communication reflects the view of the authors and neither IMI nor the European Union and EFPIA are liable for any use that may be made of the information contained herein.

PY - 2024/3/5

Y1 - 2024/3/5

N2 - Aims: Overlapping asymmetric data sets are where a large cohort of observations have a small amount of information recorded, and within this group there exists a smaller cohort which have extensive further information available. Missing imputation is unwise if cohort size differs substantially; therefore, we aim to develop a way of modelling the smaller cohort whilst considering the larger. Methods: Through considering traditionally once penalized P-Spline approximations, we create a second penalty term through observing discrepancies in the marginal value of covariates that exist in both cohorts. Our now twice penalized P-Spline is designed to firstly prevent over/under-fitting of the smaller cohort and secondly to consider the larger cohort. Results: Through a series of data simulations, penalty parameter tunings, and model adaptations, our twice penalized model offers up to a 58% and 46% improvement in model fit upon a continuous and binary response, respectively, against existing B-Spline and once penalized P-Spline methods. Applying our model to an individual’s risk of developing steatohepatitis, we report an over 65% improvement over existing methods. Conclusions: We propose a twice penalized P-Spline method which can vastly improve the model fit of overlapping asymmetric data sets upon a common predictive endpoint, without the need for missing data imputation.

AB - Aims: Overlapping asymmetric data sets are where a large cohort of observations have a small amount of information recorded, and within this group there exists a smaller cohort which have extensive further information available. Missing imputation is unwise if cohort size differs substantially; therefore, we aim to develop a way of modelling the smaller cohort whilst considering the larger. Methods: Through considering traditionally once penalized P-Spline approximations, we create a second penalty term through observing discrepancies in the marginal value of covariates that exist in both cohorts. Our now twice penalized P-Spline is designed to firstly prevent over/under-fitting of the smaller cohort and secondly to consider the larger cohort. Results: Through a series of data simulations, penalty parameter tunings, and model adaptations, our twice penalized model offers up to a 58% and 46% improvement in model fit upon a continuous and binary response, respectively, against existing B-Spline and once penalized P-Spline methods. Applying our model to an individual’s risk of developing steatohepatitis, we report an over 65% improvement over existing methods. Conclusions: We propose a twice penalized P-Spline method which can vastly improve the model fit of overlapping asymmetric data sets upon a common predictive endpoint, without the need for missing data imputation.

KW - P-Spline

KW - penalized regression

KW - smoothing

KW - asymmetric data

KW - B-Spline

KW - non-Parametric

KW - MASLD

KW - MASH

KW - health data science

U2 - 10.3390/math12050777

DO - 10.3390/math12050777

M3 - Article

SN - 2227-7390

VL - 12

JO - Mathematics

JF - Mathematics

IS - 5

M1 - 777

ER -

Handling Overlapping Asymmetric Data Sets—A Twice Penalized P-Spline Approach

Abstract

Bibliographical note

Keywords

Access to Document

Fingerprint

Projects

H2020_COLLAB (IMI)_LITMUS

Cite this