Text categorization using word similarities based on higher order co-occurrences

Syed Fawad Hussain, Gilles Bisson

Research output: Contribution to conference (unpublished)Paperpeer-review

17 Citations (Scopus)

Abstract

In this paper, we propose an extension of the χ-Sim co-clustering algorithm to deal with the text categorization task. The idea behind χ-Sim method [1] is to iteratively learn the similarity matrix between documents using similarity matrix between words and vice-versa. Thus, two documents are said to be similar if they share similar (but not necessary identical) words and two words are similar if they occur in similar documents. The algorithm has been shown to work well for unsupervised document clustering. By introducing some "a priori" knowledge about the class labels of documents in the initialization step of χ-Sim, we are able to extend the method to deal for the supervised task. The proposed approach is tested on different classical textual datasets and our experiments show that the proposed algorithm compares favorably or surpass both traditional and state-of-the-art algorithms like k-NN, supervised LSI and SVM.

Original languageEnglish
Pages1-12
Number of pages12
DOIs
Publication statusPublished - May 2010
Event10th SIAM International Conference on Data Mining, SDM 2010 - Columbus, OH, United States
Duration: 29 Apr 20101 May 2010

Conference

Conference10th SIAM International Conference on Data Mining, SDM 2010
Country/TerritoryUnited States
CityColumbus, OH
Period29/04/101/05/10

Keywords

  • Clustering
  • Higher-order co-occurrences
  • Supervised learning
  • Text categorization

ASJC Scopus subject areas

  • Software

Fingerprint

Dive into the research topics of 'Text categorization using word similarities based on higher order co-occurrences'. Together they form a unique fingerprint.

Cite this