TY - JOUR
T1 - Comparison of machine learning approaches with a general linear model to predict personal exposure to benzene
AU - Aquilina, Noel J.
AU - Delgado Saborit, Juana Maria
AU - Bugelli, Stefano
AU - Padovani Ginies, Jason
AU - Harrison, Roy
PY - 2018/8/31
Y1 - 2018/8/31
N2 - Machine Learning Techniques (MLTs) offer great power in analysing complex datasets and have not previously been applied to non-occupational pollutant exposure. MLT models that can predict personal exposure to benzene have been developed and compared with a standard model using a linear regression approach (GLM). The models were tested against independent datasets obtained from three personal exposure measurement campaigns. A Correlation-based Feature Subset (CFS) selection algorithm identified a reduced attribute set, with common attributes grouped under the use of paints in homes; upholstery materials; space heating and environmental tobacco smoke as the attributes suitable to predict the personal exposure to benzene. Personal exposure was categorised as low, medium and high, and for big datasets, both the GLM and MLTs show high variability in performance to correctly classify >90%ile concentrations, but the MLT models have a higher score when accounting for divergence of incorrectly classified cases. Overall, the MLTs perform at least as well as the GLM and avoid the need to input microenvironment concentrations.
AB - Machine Learning Techniques (MLTs) offer great power in analysing complex datasets and have not previously been applied to non-occupational pollutant exposure. MLT models that can predict personal exposure to benzene have been developed and compared with a standard model using a linear regression approach (GLM). The models were tested against independent datasets obtained from three personal exposure measurement campaigns. A Correlation-based Feature Subset (CFS) selection algorithm identified a reduced attribute set, with common attributes grouped under the use of paints in homes; upholstery materials; space heating and environmental tobacco smoke as the attributes suitable to predict the personal exposure to benzene. Personal exposure was categorised as low, medium and high, and for big datasets, both the GLM and MLTs show high variability in performance to correctly classify >90%ile concentrations, but the MLT models have a higher score when accounting for divergence of incorrectly classified cases. Overall, the MLTs perform at least as well as the GLM and avoid the need to input microenvironment concentrations.
KW - Benzene
KW - personal exposure
KW - machine learning techniques
KW - general linear model
KW - dimension reduction
U2 - 10.1021/acs.est.8b03328
DO - 10.1021/acs.est.8b03328
M3 - Article
SN - 0013-936X
JO - Environmental Science and Technology
JF - Environmental Science and Technology
ER -