Spelling errors and keywords in born-digital data: a case study using the Teenage Health Freak Corpus

Research output: Contribution to journalArticlepeer-review

Standard

Spelling errors and keywords in born-digital data: a case study using the Teenage Health Freak Corpus. / Smith, Catherine; Adolphs, Svenja; Harvey, Kevin; Mullany, Louise.

In: Corpora, Vol. 9, No. 2, 01.11.2014, p. 137-154.

Research output: Contribution to journalArticlepeer-review

Harvard

APA

Vancouver

Author

Smith, Catherine ; Adolphs, Svenja ; Harvey, Kevin ; Mullany, Louise. / Spelling errors and keywords in born-digital data: a case study using the Teenage Health Freak Corpus. In: Corpora. 2014 ; Vol. 9, No. 2. pp. 137-154.

Bibtex

@article{140383ecaa624610961b792989552734,
title = "Spelling errors and keywords in born-digital data: a case study using the Teenage Health Freak Corpus",
abstract = "The abundance of language data that is now available in digital form, and the rise of distinct language varieties that are used for digital communication, means that issues of non-standard spellings and spelling errors are, in future, likely to become more prominent for compilers of corpora. This paper examines the effect of spelling variation on keywords in a born-digital corpus in order to explore the extent and impact of this variation for future corpus studies. The corpus used in this study consists of e-mails about health concerns that were sent to a health website by adolescents. Keywords are generated using the original version of the corpus and a version with spelling errors corrected, and the British National Corpus (BNC) acts as the reference corpus. The ranks of the keywords are shown to be very similar and, therefore, suggest that, depending on the research goals, keywords could be generated reliably without any need for spelling correction.",
author = "Catherine Smith and Svenja Adolphs and Kevin Harvey and Louise Mullany",
year = "2014",
month = nov,
day = "1",
doi = "10.3366/cor.2014.0055",
language = "English",
volume = "9",
pages = "137--154",
journal = "Corpora",
issn = "1749-5032",
publisher = "Edinburgh University Press",
number = "2",

}

RIS

TY - JOUR

T1 - Spelling errors and keywords in born-digital data: a case study using the Teenage Health Freak Corpus

AU - Smith, Catherine

AU - Adolphs, Svenja

AU - Harvey, Kevin

AU - Mullany, Louise

PY - 2014/11/1

Y1 - 2014/11/1

N2 - The abundance of language data that is now available in digital form, and the rise of distinct language varieties that are used for digital communication, means that issues of non-standard spellings and spelling errors are, in future, likely to become more prominent for compilers of corpora. This paper examines the effect of spelling variation on keywords in a born-digital corpus in order to explore the extent and impact of this variation for future corpus studies. The corpus used in this study consists of e-mails about health concerns that were sent to a health website by adolescents. Keywords are generated using the original version of the corpus and a version with spelling errors corrected, and the British National Corpus (BNC) acts as the reference corpus. The ranks of the keywords are shown to be very similar and, therefore, suggest that, depending on the research goals, keywords could be generated reliably without any need for spelling correction.

AB - The abundance of language data that is now available in digital form, and the rise of distinct language varieties that are used for digital communication, means that issues of non-standard spellings and spelling errors are, in future, likely to become more prominent for compilers of corpora. This paper examines the effect of spelling variation on keywords in a born-digital corpus in order to explore the extent and impact of this variation for future corpus studies. The corpus used in this study consists of e-mails about health concerns that were sent to a health website by adolescents. Keywords are generated using the original version of the corpus and a version with spelling errors corrected, and the British National Corpus (BNC) acts as the reference corpus. The ranks of the keywords are shown to be very similar and, therefore, suggest that, depending on the research goals, keywords could be generated reliably without any need for spelling correction.

U2 - 10.3366/cor.2014.0055

DO - 10.3366/cor.2014.0055

M3 - Article

VL - 9

SP - 137

EP - 154

JO - Corpora

JF - Corpora

SN - 1749-5032

IS - 2

ER -