Text categorization using word similarities based on higher order co-occurrences

Syed Fawad Hussain, Gilles Bisson

In this paper, we propose an extension of the χ-Sim co-clustering algorithm to deal with the text categorization task. The idea behind χ-Sim method [1] is to iteratively learn the similarity matrix between documents using similarity matrix between words and vice-versa. Thus, two documents are said to be similar if they share similar (but not necessary identical) words and two words are similar if they occur in similar documents. The algorithm has been shown to work well for unsupervised document clustering. By introducing some "a priori" knowledge about the class labels of documents in the initialization step of χ-Sim, we are able to extend the method to deal for the supervised task. The proposed approach is tested on different classical textual datasets and our experiments show that the proposed algorithm compares favorably or surpass both traditional and state-of-the-art algorithms like k-NN, supervised LSI and SVM.

Published - May 2010
10th SIAM International Conference on Data Mining, SDM 2010 - Columbus, OH, United States
29 Apr 2010 – 1 May 2010


10th SIAM International Conference on Data Mining, SDM 2010
Columbus, OH


  • Clustering
  • Higher-order co-occurrences
  • Supervised learning
  • Text categorization

