TY - GEN
T1 - Online density estimation of heterogeneous data streams in higher dimensions
AU - Geilke, Michael
AU - Karwath, Andreas
AU - Kramer, Stefan
PY - 2016/9/4
Y1 - 2016/9/4
N2 - The joint density of a data stream is suitable for performing data mining tasks without having access to the original data. However, the methods proposed so far only target a small to medium number of variables, since their estimates rely on representing all the interdependencies between the variables of the data. High-dimensional data streams, which are becoming more and more frequent due to increasing numbers of interconnected devices, are, therefore, pushing these methods to their limits. To mitigate these limitations, we present an approach that projects the original data stream into a vector space and uses a set of representatives to provide an estimate. Due to the structure of the estimates, it enables the density estimation of higher-dimensional data and approaches the true density with increasing dimensionality of the vector space. Moreover, it is not only designed to estimate homogeneous data, i.e., where all variables are nominal or all variables are numeric, but it can also estimate heterogeneous data. The evaluation is conducted on synthetic and real-world data. The software related to this paper is available at https://github.com/geilke/mideo.
AB - The joint density of a data stream is suitable for performing data mining tasks without having access to the original data. However, the methods proposed so far only target a small to medium number of variables, since their estimates rely on representing all the interdependencies between the variables of the data. High-dimensional data streams, which are becoming more and more frequent due to increasing numbers of interconnected devices, are, therefore, pushing these methods to their limits. To mitigate these limitations, we present an approach that projects the original data stream into a vector space and uses a set of representatives to provide an estimate. Due to the structure of the estimates, it enables the density estimation of higher-dimensional data and approaches the true density with increasing dimensionality of the vector space. Moreover, it is not only designed to estimate homogeneous data, i.e., where all variables are nominal or all variables are numeric, but it can also estimate heterogeneous data. The evaluation is conducted on synthetic and real-world data. The software related to this paper is available at https://github.com/geilke/mideo.
KW - data mining
KW - density estimation
KW - stream mining
U2 - 10.1007/978-3-319-46128-1_5
DO - 10.1007/978-3-319-46128-1_5
M3 - Conference contribution
SN - 9783319461274
T3 - Lecture Notes in Computer Science
SP - 65
EP - 80
BT - Machine Learning and Knowledge Discovery in Databases
PB - Springer
T2 - European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases
Y2 - 19 September 2016 through 23 September 2016
ER -