Text categorization using word similarities based on higher order co-occurrences

Syed Fawad Hussain; Gilles Bisson

doi:10.1137/1.9781611972801.1

Text categorization using word similarities based on higher order co-occurrences

Syed Fawad Hussain, Gilles Bisson

Computer Science

Research output: Contribution to conference (unpublished) › Paper › peer-review

17 Citations (Scopus)

Abstract

In this paper, we propose an extension of the χ-Sim co-clustering algorithm to deal with the text categorization task. The idea behind χ-Sim method [1] is to iteratively learn the similarity matrix between documents using similarity matrix between words and vice-versa. Thus, two documents are said to be similar if they share similar (but not necessary identical) words and two words are similar if they occur in similar documents. The algorithm has been shown to work well for unsupervised document clustering. By introducing some "a priori" knowledge about the class labels of documents in the initialization step of χ-Sim, we are able to extend the method to deal for the supervised task. The proposed approach is tested on different classical textual datasets and our experiments show that the proposed algorithm compares favorably or surpass both traditional and state-of-the-art algorithms like k-NN, supervised LSI and SVM.

Original language	English
Pages	1-12
Number of pages	12
DOIs	https://doi.org/10.1137/1.9781611972801.1
Publication status	Published - May 2010
Event	10th SIAM International Conference on Data Mining, SDM 2010 - Columbus, OH, United States Duration: 29 Apr 2010 → 1 May 2010

Conference

Conference	10th SIAM International Conference on Data Mining, SDM 2010
Country/Territory	United States
City	Columbus, OH
Period	29/04/10 → 1/05/10

Keywords

Clustering
Higher-order co-occurrences
Supervised learning
Text categorization

ASJC Scopus subject areas

Software

Access to Document

10.1137/1.9781611972801.1

Cite this

@conference{62b30e92a4364906a80b13cabb8a7dc0,

title = "Text categorization using word similarities based on higher order co-occurrences",

abstract = "In this paper, we propose an extension of the χ-Sim co-clustering algorithm to deal with the text categorization task. The idea behind χ-Sim method [1] is to iteratively learn the similarity matrix between documents using similarity matrix between words and vice-versa. Thus, two documents are said to be similar if they share similar (but not necessary identical) words and two words are similar if they occur in similar documents. The algorithm has been shown to work well for unsupervised document clustering. By introducing some {"}a priori{"} knowledge about the class labels of documents in the initialization step of χ-Sim, we are able to extend the method to deal for the supervised task. The proposed approach is tested on different classical textual datasets and our experiments show that the proposed algorithm compares favorably or surpass both traditional and state-of-the-art algorithms like k-NN, supervised LSI and SVM.",

keywords = "Clustering, Higher-order co-occurrences, Supervised learning, Text categorization",

author = "Hussain, {Syed Fawad} and Gilles Bisson",

year = "2010",

month = may,

doi = "10.1137/1.9781611972801.1",

language = "English",

pages = "1--12",

note = "10th SIAM International Conference on Data Mining, SDM 2010 ; Conference date: 29-04-2010 Through 01-05-2010",

}

TY - CONF

T1 - Text categorization using word similarities based on higher order co-occurrences

AU - Hussain, Syed Fawad

AU - Bisson, Gilles

PY - 2010/5

Y1 - 2010/5

N2 - In this paper, we propose an extension of the χ-Sim co-clustering algorithm to deal with the text categorization task. The idea behind χ-Sim method [1] is to iteratively learn the similarity matrix between documents using similarity matrix between words and vice-versa. Thus, two documents are said to be similar if they share similar (but not necessary identical) words and two words are similar if they occur in similar documents. The algorithm has been shown to work well for unsupervised document clustering. By introducing some "a priori" knowledge about the class labels of documents in the initialization step of χ-Sim, we are able to extend the method to deal for the supervised task. The proposed approach is tested on different classical textual datasets and our experiments show that the proposed algorithm compares favorably or surpass both traditional and state-of-the-art algorithms like k-NN, supervised LSI and SVM.

AB - In this paper, we propose an extension of the χ-Sim co-clustering algorithm to deal with the text categorization task. The idea behind χ-Sim method [1] is to iteratively learn the similarity matrix between documents using similarity matrix between words and vice-versa. Thus, two documents are said to be similar if they share similar (but not necessary identical) words and two words are similar if they occur in similar documents. The algorithm has been shown to work well for unsupervised document clustering. By introducing some "a priori" knowledge about the class labels of documents in the initialization step of χ-Sim, we are able to extend the method to deal for the supervised task. The proposed approach is tested on different classical textual datasets and our experiments show that the proposed algorithm compares favorably or surpass both traditional and state-of-the-art algorithms like k-NN, supervised LSI and SVM.

KW - Clustering

KW - Higher-order co-occurrences

KW - Supervised learning

KW - Text categorization

UR - http://www.scopus.com/inward/record.url?scp=84880126925&partnerID=8YFLogxK

U2 - 10.1137/1.9781611972801.1

DO - 10.1137/1.9781611972801.1

M3 - Paper

AN - SCOPUS:84880126925

SP - 1

EP - 12

T2 - 10th SIAM International Conference on Data Mining, SDM 2010

Y2 - 29 April 2010 through 1 May 2010

ER -

Text categorization using word similarities based on higher order co-occurrences

Abstract

Conference

Keywords

ASJC Scopus subject areas

Access to Document

Fingerprint

Cite this