What to do when K-means clustering fails: a simple yet principled alternative algorithm

Yordan P. Raykov; Alexis Boukouvalas; Fahd Baig; Max A. Little

doi:10.1371/journal.pone.0162259

What to do when K-means clustering fails: a simple yet principled alternative algorithm

Yordan P. Raykov, Alexis Boukouvalas, Fahd Baig, Max A. Little

Computer Science

Research output: Contribution to journal › Article › peer-review

52 Citations (Scopus)

134 Downloads (Pure)

Abstract

The K-means algorithm is one of the most popular clustering algorithms in current use as it is relatively fast yet simple to understand and deploy in practice. Nevertheless, its use entails certain restrictive assumptions about the data, the negative consequences of which are not always immediately apparent, as we demonstrate. While more flexible algorithms have been developed, their widespread use has been hindered by their computational and technical complexity. Motivated by these considerations, we present a flexible alternative to K-means that relaxes most of the assumptions, whilst remaining almost as fast and simple. This novel algorithm which we call MAP-DP (maximum a-posteriori Dirichlet process mixtures), is statistically rigorous as it is based on nonparametric Bayesian Dirichlet process mixture modeling. This approach allows us to overcome most of the limitations imposed by K-means. The number of clusters K is estimated from the data instead of being fixed a-priori as in K-means. In addition, while K-means is restricted to continuous data, the MAP-DP framework can be applied to many kinds of data, for example, binary, count or ordinal data. Also, it can efficiently separate outliers from the data. This additional flexibility does not incur a significant computational overhead compared to K-means with MAP-DP convergence typically achieved in the order of seconds for many practical problems. Finally, in contrast to K-means, since the algorithm is based on an underlying statistical model, the MAP-DP framework can deal with missing data and enables model testing such as cross validation in a principled way. We demonstrate the simplicity and effectiveness of this algorithm on the health informatics problem of clinical sub-typing in a cluster of diseases known as parkinsonism.

Original language	English
Article number	e0162259
Number of pages	28
Journal	PLoS ONE
Volume	11
Issue number	9
DOIs	https://doi.org/10.1371/journal.pone.0162259
Publication status	Published - 26 Sept 2016

Access to Document

10.1371/journal.pone.0162259Licence: Creative Commons: Attribution (CC BY)

Raykov_et_al_What_to_do_when_K-means_clustering_fails_PLoS_ONE_2016
Checked for eligibility: 04/09/2019
Final published version, 3.45 MBLicence: Creative Commons: Attribution (CC BY)

https://dx.plos.org/10.1371/journal.pone.0162259Licence: Creative Commons: Attribution (CC BY)

Cite this

@article{6ba55c72b1b34cbf9250b7f764c4a331,

title = "What to do when K-means clustering fails: a simple yet principled alternative algorithm",

abstract = "The K-means algorithm is one of the most popular clustering algorithms in current use as it is relatively fast yet simple to understand and deploy in practice. Nevertheless, its use entails certain restrictive assumptions about the data, the negative consequences of which are not always immediately apparent, as we demonstrate. While more flexible algorithms have been developed, their widespread use has been hindered by their computational and technical complexity. Motivated by these considerations, we present a flexible alternative to K-means that relaxes most of the assumptions, whilst remaining almost as fast and simple. This novel algorithm which we call MAP-DP (maximum a-posteriori Dirichlet process mixtures), is statistically rigorous as it is based on nonparametric Bayesian Dirichlet process mixture modeling. This approach allows us to overcome most of the limitations imposed by K-means. The number of clusters K is estimated from the data instead of being fixed a-priori as in K-means. In addition, while K-means is restricted to continuous data, the MAP-DP framework can be applied to many kinds of data, for example, binary, count or ordinal data. Also, it can efficiently separate outliers from the data. This additional flexibility does not incur a significant computational overhead compared to K-means with MAP-DP convergence typically achieved in the order of seconds for many practical problems. Finally, in contrast to K-means, since the algorithm is based on an underlying statistical model, the MAP-DP framework can deal with missing data and enables model testing such as cross validation in a principled way. We demonstrate the simplicity and effectiveness of this algorithm on the health informatics problem of clinical sub-typing in a cluster of diseases known as parkinsonism.",

author = "Raykov, {Yordan P.} and Alexis Boukouvalas and Fahd Baig and Little, {Max A.}",

year = "2016",

month = sep,

day = "26",

doi = "10.1371/journal.pone.0162259",

language = "English",

volume = "11",

journal = "PLoS ONE",

issn = "1932-6203",

publisher = "Public Library of Science (PLOS)",

number = "9",

}

TY - JOUR

T1 - What to do when K-means clustering fails

T2 - a simple yet principled alternative algorithm

AU - Raykov, Yordan P.

AU - Boukouvalas, Alexis

AU - Baig, Fahd

AU - Little, Max A.

PY - 2016/9/26

Y1 - 2016/9/26

N2 - The K-means algorithm is one of the most popular clustering algorithms in current use as it is relatively fast yet simple to understand and deploy in practice. Nevertheless, its use entails certain restrictive assumptions about the data, the negative consequences of which are not always immediately apparent, as we demonstrate. While more flexible algorithms have been developed, their widespread use has been hindered by their computational and technical complexity. Motivated by these considerations, we present a flexible alternative to K-means that relaxes most of the assumptions, whilst remaining almost as fast and simple. This novel algorithm which we call MAP-DP (maximum a-posteriori Dirichlet process mixtures), is statistically rigorous as it is based on nonparametric Bayesian Dirichlet process mixture modeling. This approach allows us to overcome most of the limitations imposed by K-means. The number of clusters K is estimated from the data instead of being fixed a-priori as in K-means. In addition, while K-means is restricted to continuous data, the MAP-DP framework can be applied to many kinds of data, for example, binary, count or ordinal data. Also, it can efficiently separate outliers from the data. This additional flexibility does not incur a significant computational overhead compared to K-means with MAP-DP convergence typically achieved in the order of seconds for many practical problems. Finally, in contrast to K-means, since the algorithm is based on an underlying statistical model, the MAP-DP framework can deal with missing data and enables model testing such as cross validation in a principled way. We demonstrate the simplicity and effectiveness of this algorithm on the health informatics problem of clinical sub-typing in a cluster of diseases known as parkinsonism.

AB - The K-means algorithm is one of the most popular clustering algorithms in current use as it is relatively fast yet simple to understand and deploy in practice. Nevertheless, its use entails certain restrictive assumptions about the data, the negative consequences of which are not always immediately apparent, as we demonstrate. While more flexible algorithms have been developed, their widespread use has been hindered by their computational and technical complexity. Motivated by these considerations, we present a flexible alternative to K-means that relaxes most of the assumptions, whilst remaining almost as fast and simple. This novel algorithm which we call MAP-DP (maximum a-posteriori Dirichlet process mixtures), is statistically rigorous as it is based on nonparametric Bayesian Dirichlet process mixture modeling. This approach allows us to overcome most of the limitations imposed by K-means. The number of clusters K is estimated from the data instead of being fixed a-priori as in K-means. In addition, while K-means is restricted to continuous data, the MAP-DP framework can be applied to many kinds of data, for example, binary, count or ordinal data. Also, it can efficiently separate outliers from the data. This additional flexibility does not incur a significant computational overhead compared to K-means with MAP-DP convergence typically achieved in the order of seconds for many practical problems. Finally, in contrast to K-means, since the algorithm is based on an underlying statistical model, the MAP-DP framework can deal with missing data and enables model testing such as cross validation in a principled way. We demonstrate the simplicity and effectiveness of this algorithm on the health informatics problem of clinical sub-typing in a cluster of diseases known as parkinsonism.

U2 - 10.1371/journal.pone.0162259

DO - 10.1371/journal.pone.0162259

M3 - Article

SN - 1932-6203

VL - 11

JO - PLoS ONE

JF - PLoS ONE

IS - 9

M1 - e0162259

ER -

What to do when K-means clustering fails: a simple yet principled alternative algorithm

Abstract

Access to Document

Fingerprint

Cite this