Extracting Semantic Representations From Word Co-Occurrence Statistics: Stop-Lists, Stemming, and Svd

John Bullinaria; Joseph P Levy

doi:10.3758/s13428-011-0183-8

Extracting Semantic Representations From Word Co-Occurrence Statistics: Stop-Lists, Stemming, and Svd

John Bullinaria, Joseph P Levy

Computer Science

Research output: Contribution to journal › Article

146 Citations (Scopus)

Abstract

In a previous article, we presented a systematic computational study of the extraction of semantic representations from the word-word co-occurrence statistics of large text corpora. The conclusion was that semantic vectors of pointwise mutual information values from very small co-occurrence windows, together with a cosine distance measure, consistently resulted in the best representations across a range of psychologically relevant semantic tasks. This article extends that study by investigating the use of three further factors-namely, the application of stop-lists, word stemming, and dimensionality reduction using singular value decomposition (SVD)-that have been used to provide improved performance elsewhere. It also introduces an additional semantic task and explores the advantages of using a much larger corpus. This leads to the discovery and analysis of improved SVD-based methods for generating semantic representations (that provide new state-of-the-art performance on a standard TOEFL task) and the identification and discussion of problems and misleading results that can arise without a full systematic study.

Original language	English
Pages (from-to)	890-907
Journal	Behavior Research Methods
Volume	44
Issue number	3
DOIs	https://doi.org/10.3758/s13428-011-0183-8
Publication status	Published - 19 Jan 2012

Access to Document

10.3758/s13428-011-0183-8

Cite this

@article{f251cb421c014283b9af618f4388bfbc,

title = "Extracting Semantic Representations From Word Co-Occurrence Statistics: Stop-Lists, Stemming, and Svd",

abstract = "In a previous article, we presented a systematic computational study of the extraction of semantic representations from the word-word co-occurrence statistics of large text corpora. The conclusion was that semantic vectors of pointwise mutual information values from very small co-occurrence windows, together with a cosine distance measure, consistently resulted in the best representations across a range of psychologically relevant semantic tasks. This article extends that study by investigating the use of three further factors-namely, the application of stop-lists, word stemming, and dimensionality reduction using singular value decomposition (SVD)-that have been used to provide improved performance elsewhere. It also introduces an additional semantic task and explores the advantages of using a much larger corpus. This leads to the discovery and analysis of improved SVD-based methods for generating semantic representations (that provide new state-of-the-art performance on a standard TOEFL task) and the identification and discussion of problems and misleading results that can arise without a full systematic study.",

author = "John Bullinaria and Levy, {Joseph P}",

year = "2012",

month = jan,

day = "19",

doi = "10.3758/s13428-011-0183-8",

language = "English",

volume = "44",

pages = "890--907",

journal = "Behavior Research Methods",

issn = "1554-3528",

publisher = "Springer",

number = "3",

}

TY - JOUR

T1 - Extracting Semantic Representations From Word Co-Occurrence Statistics: Stop-Lists, Stemming, and Svd

AU - Bullinaria, John

AU - Levy, Joseph P

PY - 2012/1/19

Y1 - 2012/1/19

N2 - In a previous article, we presented a systematic computational study of the extraction of semantic representations from the word-word co-occurrence statistics of large text corpora. The conclusion was that semantic vectors of pointwise mutual information values from very small co-occurrence windows, together with a cosine distance measure, consistently resulted in the best representations across a range of psychologically relevant semantic tasks. This article extends that study by investigating the use of three further factors-namely, the application of stop-lists, word stemming, and dimensionality reduction using singular value decomposition (SVD)-that have been used to provide improved performance elsewhere. It also introduces an additional semantic task and explores the advantages of using a much larger corpus. This leads to the discovery and analysis of improved SVD-based methods for generating semantic representations (that provide new state-of-the-art performance on a standard TOEFL task) and the identification and discussion of problems and misleading results that can arise without a full systematic study.

AB - In a previous article, we presented a systematic computational study of the extraction of semantic representations from the word-word co-occurrence statistics of large text corpora. The conclusion was that semantic vectors of pointwise mutual information values from very small co-occurrence windows, together with a cosine distance measure, consistently resulted in the best representations across a range of psychologically relevant semantic tasks. This article extends that study by investigating the use of three further factors-namely, the application of stop-lists, word stemming, and dimensionality reduction using singular value decomposition (SVD)-that have been used to provide improved performance elsewhere. It also introduces an additional semantic task and explores the advantages of using a much larger corpus. This leads to the discovery and analysis of improved SVD-based methods for generating semantic representations (that provide new state-of-the-art performance on a standard TOEFL task) and the identification and discussion of problems and misleading results that can arise without a full systematic study.

U2 - 10.3758/s13428-011-0183-8

DO - 10.3758/s13428-011-0183-8

M3 - Article

C2 - 22258891

SN - 1554-3528

VL - 44

SP - 890

EP - 907

JO - Behavior Research Methods

JF - Behavior Research Methods

IS - 3

ER -

Extracting Semantic Representations From Word Co-Occurrence Statistics: Stop-Lists, Stemming, and Svd

Abstract

Access to Document

Fingerprint

Cite this