Projects per year
Abstract
Beyer et al. gave a sufficient condition for the high dimensional phenomenon known as the concentration of distances. Their work has pinpointed serious problems due to nearest neighbours not being meaningful in high dimensions. Here we establish the converse of their result, in order to answer the question as to when nearest neighbour is still meaningful in arbitrarily high dimensions. We then show for a class of realistic data distributions having non-i.i.d. dimensions, namely the family of linear latent variable models, that the Euclidean distance will not concentrate as long as the amount of 'relevant' dimensions grows no slower than the overall data dimensions. This condition is, of course, often not met in practice. After numerically validating our findings. we examine real data situations in two different areas (text-based document collections and gene expression arrays), which suggest that the presence or absence of distance concentration in high dimensional problems plays a role in making the data hard or easy to work with. (C) 2009 Elsevier Inc. All rights reserved.
Original language | English |
---|---|
Pages (from-to) | 385-397 |
Number of pages | 13 |
Journal | Journal of Complexity |
Volume | 25 |
Issue number | 4 |
DOIs | |
Publication status | Published - 1 Aug 2009 |
Keywords
- High dimensionality
- Distance concentration
- Latent variable models
Fingerprint
Dive into the research topics of 'When is 'nearest neighbour' meaningful: A converse theorem and implications'. Together they form a unique fingerprint.Projects
- 1 Finished
-
Discipline Hopping Award: Generative-discriminative Hybrids for Disease Prediction & Cell Communication Modelling
22/09/08 → 21/09/09
Project: Research Councils