When is 'nearest neighbour' meaningful: A converse theorem and implications

Robert Durrant, Ata Kaban

Research output: Contribution to journalArticle

54 Citations (Scopus)

Abstract

Beyer et al. gave a sufficient condition for the high dimensional phenomenon known as the concentration of distances. Their work has pinpointed serious problems due to nearest neighbours not being meaningful in high dimensions. Here we establish the converse of their result, in order to answer the question as to when nearest neighbour is still meaningful in arbitrarily high dimensions. We then show for a class of realistic data distributions having non-i.i.d. dimensions, namely the family of linear latent variable models, that the Euclidean distance will not concentrate as long as the amount of 'relevant' dimensions grows no slower than the overall data dimensions. This condition is, of course, often not met in practice. After numerically validating our findings. we examine real data situations in two different areas (text-based document collections and gene expression arrays), which suggest that the presence or absence of distance concentration in high dimensional problems plays a role in making the data hard or easy to work with. (C) 2009 Elsevier Inc. All rights reserved.
Original languageEnglish
Pages (from-to)385-397
Number of pages13
JournalJournal of Complexity
Volume25
Issue number4
DOIs
Publication statusPublished - 1 Aug 2009

Keywords

  • High dimensionality
  • Distance concentration
  • Latent variable models

Fingerprint

Dive into the research topics of 'When is 'nearest neighbour' meaningful: A converse theorem and implications'. Together they form a unique fingerprint.

Cite this