Abstract
In this paper, we propose an extension of the χ-Sim co-clustering algorithm to deal with the text categorization task. The idea behind χ-Sim method [1] is to iteratively learn the similarity matrix between documents using similarity matrix between words and vice-versa. Thus, two documents are said to be similar if they share similar (but not necessary identical) words and two words are similar if they occur in similar documents. The algorithm has been shown to work well for unsupervised document clustering. By introducing some "a priori" knowledge about the class labels of documents in the initialization step of χ-Sim, we are able to extend the method to deal for the supervised task. The proposed approach is tested on different classical textual datasets and our experiments show that the proposed algorithm compares favorably or surpass both traditional and state-of-the-art algorithms like k-NN, supervised LSI and SVM.
Original language | English |
---|---|
Pages | 1-12 |
Number of pages | 12 |
DOIs | |
Publication status | Published - May 2010 |
Event | 10th SIAM International Conference on Data Mining, SDM 2010 - Columbus, OH, United States Duration: 29 Apr 2010 → 1 May 2010 |
Conference
Conference | 10th SIAM International Conference on Data Mining, SDM 2010 |
---|---|
Country/Territory | United States |
City | Columbus, OH |
Period | 29/04/10 → 1/05/10 |
Keywords
- Clustering
- Higher-order co-occurrences
- Supervised learning
- Text categorization
ASJC Scopus subject areas
- Software