TY - JOUR
T1 - Extracting Semantic Representations From Word Co-Occurrence Statistics: Stop-Lists, Stemming, and Svd
AU - Bullinaria, John
AU - Levy, Joseph P
PY - 2012/1/19
Y1 - 2012/1/19
N2 - In a previous article, we presented a systematic computational study of the extraction of semantic representations from the word-word co-occurrence statistics of large text corpora. The conclusion was that semantic vectors of pointwise mutual information values from very small co-occurrence windows, together with a cosine distance measure, consistently resulted in the best representations across a range of psychologically relevant semantic tasks. This article extends that study by investigating the use of three further factors-namely, the application of stop-lists, word stemming, and dimensionality reduction using singular value decomposition (SVD)-that have been used to provide improved performance elsewhere. It also introduces an additional semantic task and explores the advantages of using a much larger corpus. This leads to the discovery and analysis of improved SVD-based methods for generating semantic representations (that provide new state-of-the-art performance on a standard TOEFL task) and the identification and discussion of problems and misleading results that can arise without a full systematic study.
AB - In a previous article, we presented a systematic computational study of the extraction of semantic representations from the word-word co-occurrence statistics of large text corpora. The conclusion was that semantic vectors of pointwise mutual information values from very small co-occurrence windows, together with a cosine distance measure, consistently resulted in the best representations across a range of psychologically relevant semantic tasks. This article extends that study by investigating the use of three further factors-namely, the application of stop-lists, word stemming, and dimensionality reduction using singular value decomposition (SVD)-that have been used to provide improved performance elsewhere. It also introduces an additional semantic task and explores the advantages of using a much larger corpus. This leads to the discovery and analysis of improved SVD-based methods for generating semantic representations (that provide new state-of-the-art performance on a standard TOEFL task) and the identification and discussion of problems and misleading results that can arise without a full systematic study.
U2 - 10.3758/s13428-011-0183-8
DO - 10.3758/s13428-011-0183-8
M3 - Article
C2 - 22258891
SN - 1554-3528
VL - 44
SP - 890
EP - 907
JO - Behavior Research Methods
JF - Behavior Research Methods
IS - 3
ER -